http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Jdeng&feedformat=atomstatwiki - User contributions [US]2023-02-09T12:43:10ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=FeUdal_Networks_for_Hierarchical_Reinforcement_Learning&diff=31750FeUdal Networks for Hierarchical Reinforcement Learning2017-12-03T05:59:53Z<p>Jdeng: /* Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Reinforcement learning (RL) is a facet of machine learning which inspired by behaviourist psychology wherein the algorithm is takes on the required actions to maximize the cumulative reward. Even though deep reinforcement learning has been hugely successful in a variety of domains, it has not been able to succeed in environments which have sparsely spaced reward signals. Take for instance the infamous<br />
Montezuma’s Revenge ATARI game ([https://www.reddit.com/r/MachineLearning/comments/45fa9o/why_montezuma_revenge_doesnt_work_in_deepmind/ a related discussion on Reddit]) which encounters a major challenge of long-term credit assignment. Essentially, the agent is not able to attribute a reward to an action taken several timesteps back. <br />
<br />
This paper proposes a hierarchical reinforcement learning architecture (HRL), called FeUdal Networks (FuN), which has been inspired by Feudal Reinforcement Learning (FRL)[3]. One of the main characteristics of FRL is that the goals can be generated in a top-down fashion, and goal setting can be decoupled from goal achievement. The level in the hierarchy communicates and delegates work to the level below it but doesn't specify how to do so. It is a fully-differentiable neural network with two levels of hierarchy – a Manager module at the top level and a Worker module below. The Manager sets abstract goals, which are learned, at a lower temporal resolution in a latent state-space. The Worker operates at a higher temporal resolution and produces primitive actions at every tick of the environment, motivated to follow the goals received from Manager, by an intrinsic reward.<br />
<br />
The key contributions of the authors in this paper are: <br />
# A consistent, end-to-end differentiable FRL inspired HRL;<br />
# A novel, approximate transition policy gradient update for training the Manager;<br />
# The use of goals that are directional rather than absolute in nature; <br />
# Dilated LSTM – a novel RNN design for the Manager that allows gradients to flow through large hops in time.<br />
<br />
The experiments conducted on several tasks which involve sparse rewards show that FuN significantly outperforms a strong baseline agent on tasks that involve long-term credit assignment and memorization.<br />
<br />
= Related Work =<br />
<br />
Several hierarchical reinforcement learning models were proposed to solve this problem. The options framework [4] considers the problem with a two-level hierarchy, with options being typically learned using sub-goals and ‘pseudo-rewards’ that are provided explicitly. Whereas, the option-critic architecture[1] uses the policy gradient theorem for learning options in an end-to-end fashion. A problem with learning options end-to-end is that they tend to a trivial solution where: (i) only one option is active, which solves the whole task; (ii) a policy-over-options changes options at every step, micro-managing the behavior. The authors state that the option-critic architecture is the only other end-to-end trainable system with sub-policies. A key difference between the authors' approach and the options framework is that the top level (manager) produces a meaningful and explicit goal for the bottom level (worker) to achieve.<br />
<br />
Non-hierarchical deep RL (non-HRL) methods using auxiliary losses and rewards such as pseudo count for exploration[2] have significantly improved results by stimulating agents to explore new parts of the state space. The UNREAL agent[9] is another non-HRL method that showed a strong improvement using unsupervised auxiliary tasks. The reason that such adjustment to conduct policy search is improving the learning results is because in some of the more complex environments, the process of learning policies that lead to onset of a reward is long and therefore harder to train the model. The authors called this sparsity of reward and they utilized auxiliary task of reward prediction, which means predicting the onset of immediate reward given some historical context.<br />
<br />
= Preliminaries = <br />
=== Long short-term memory ===<br />
The Long short-term memory network is a simple RNN which is often used an a building block of a larger recurrent network. An LSTM network consists of four main components: a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals. The three gates are often interpreted as artificial neurons as in a MLP neural network, and the parameters related to the gates are also learnt during training. This network was designed to mimic short-term memory which can last for a long period of time. LSTMs are well suited for the classification, processing and prediction of time series given time lags of unknown size and duration. An LSTM structure is used for the Manager in this Reinforcement Learning paper.<br />
<br />
= Model =<br />
<br />
[[File:feudal_network_model_diagram.png|frame]]<br />
<br />
A high-level explanation of the model is as follows: <br />
<br />
The Manager computes a latent state representation <math>s_t</math> and outputs a goal vector <math>g_t</math> . The Worker outputs actions based on the environment observation, its own state, and the Manager’s goal. A perceptual module computes intermediate representation, <math>z_t</math> of the environment observation <math>x_t</math>, and is shared as input by both Manager and Worker. The Manager’s goals <math>g_t</math> are trained using an approximate transition policy gradient. The Worker is then trained via intrinsic reward which stimulates it to output actions that will achieve the goals set by the Manager.<br />
<br />
<center><br />
[[File:model_definition.png|500px]]<br />
</center><br />
<br />
Manager and Worker are recurrent networks (<math>{h^M}</math> and <math>{h^W}</math> being their internal states). <math>\phi</math> is a linear transform that maps a goal <math>g_t</math> into an embedding vector <math>w_t \in {R^k}</math> , which is then combined with matrix <math>U_t</math> (Worker's output) via a matrix-vector product to produce policy <math>\pi</math> – vector of probabilities over primitive actions. The projection <math>\phi</math> is linear, with no biases, and is learnt with gradients coming from the Worker’s actions.Since <math>\phi</math> has no biases it can never produce a constant non-zero vector – which is the only way the setup could ignore the Manager’s input. This makes sure that the goal output by the Manager always influences the final policy. In summary, my understanding of the model of goal embedding is: (i) Looking at this model wherein the state is inndependent of, the worker defines a set of partitions of the goal sphere via the embedding function; so each dimension now responds positively to one half of the sphere and negatively to the other half. (ii) Conditioned on the state, the worker then describes each action as a weighted sum of those partitions. Hence, given the state, the worker defines regions of the goal sphere where each action is more likely to be taken.<br />
<br />
===Learning===<br />
The learning considers a standard reinforcement learning setup where the goal of the agent is to maximize the discounted return <math>R_t = \sum_{k=0}^{&infin;} \gamma^k r_{t+k+1}</math>; where <math>\gamma \in [0,1]; r_t</math> is the reward from environment for action at timestep, <math>t</math>. The agent's behavior is defined by its action-selection policy, <math>\pi</math>.<br />
<br />
Since FuN is fully differentiable, the authors could have trained it end-to-end using a policy gradient algorithm operating on the actions taken by the Worker such the outputs <math>g</math> of the Manager would be trained by gradients coming from the Worker. This, however, would deprive Manager’s goals <math>g</math> of any semantic meaning, making them just internal latent variables of the model. So instead, Manager is independently trained to predict advantageous directions (transitions) in state space and to intrinsically reward the Worker to follow these directions.<br />
<br />
Update rule for manager:<br />
<br />
<br />
<center><br />
<math>\nabla g_t = A_t^M \nabla_\theta d_{cos}(s_{t+c} - s_t, g_t(\theta))</math><br />
</center><br />
<br />
<br />
In above equation, <math>d_{cos}(\alpha, \beta) = \alpha^T \beta/(|\alpha||\beta|)</math> is the cosine similarity between two vectors and <math>A_t^M = R_t - V_t^M(x_t,\theta)</math> is the Manager’s advantage function, computed using a value function estimate <math>V_t^M(x_t,\theta)</math> from the internal critic. Here c is an event horizon for the Manager to optimize its direction on. It must be treated as a hyperparameter of the model. It controls the temporal resolution of the Manager. Notice that the last c goals of manager are also first pooled by summation and then embedded into a vector. That makes the conditioning from the manager vary smoothly. <br />
<br />
The intrinsic reward that encourages the Worker to follow the goals are defined as:<br />
<br />
<br />
<center><br />
<math>R_t^I = 1/c \sum_{i=1}^c d_{cos}(s_t - s_{t-i}, g_{t-i})</math> <br />
</center><br />
Directions ($s_t - s_{t-i}$)is used because it is more feasible for the Worker to be able to reliably cause directional shifts in the latent state than it is to assume that the Worker can take us to (potentially) arbitrary new absolute locations. <br />
<br />
Compared to FRL[3], which advocated concealing the reward from the environment from lower levels of the hierarchy, the Worker in FuN network is trained using an advantage actor-critic[5] to maximise a weighted sum <math>R_t + &alpha; R_t^I</math> , where <math>&alpha;</math> is a hyper-parameter that regulates the influence of the intrinsic reward:<br />
<br />
<br />
<center><br />
<math>\nabla {\pi}_t = A_t^D \nabla_\theta log \pi (a_t|x_t;\theta)</math><br />
</center><br />
<br />
<br />
The Advantage function <math>A_t^D = (R_t + \alpha R_t^I - V_t^D(x_t;\theta))</math> is calculated using an internal critic, which estimates the value functions for both rewards.<br />
<br />
The authors also make note of the fact that the Worker and Manager can have different discount factors $\gamma$ for computing return. This allows the Worker to focus on more immediate rewards while the Manager can make decisions over a longer time horizon.<br />
<br />
===Transition Policy Gradient===<br />
The update rule for the Manager given above is a novel form of policy gradient with respect to a ''model'' of the Worker’s behavior. The Worker can follow a complex trajectory but it is not necessarily required to learn from these samples. If the trajectories can be predicted, by modeling the transitions, then the policy gradient of the predicted transition can be followed instead of the Worker's actual path. FuN assumes a particular form for the transition model: that the direction in state-space, <math>s_{t+c} − s_t</math>, follows a von Mises-Fisher distribution (it is a probability distribution on the (p-1)-dimensional sphere in R<sup>p</sup>, for more information [15]).<br />
<br />
=Architecture=<br />
The perceptual module <math>f^{percept}</math> is a convolutional network (CNN) followed by a fully connected layer. Each convolutional and fully-connected layer is followed by a rectifier non-linearity. <math>f^{Mspace}</math>, which is another fully connected layer followed by a rectifier non-linearity, is used to compute the state space, which the Manager uses to formulate goals. The Worker’s recurrent network <math>f^{Wrnn}</math> is a standard LSTM[6].<br />
<br />
<br />
The Manager uses a novel architecture called a dilated LSTM (dLSTM), which operates at lower temporal resolution than the data stream. It is similar to dilated convolutional networks[7] and clockwork RNN. For a dilation radius r, the network is composed of r separate groups of sub-states or ‘cores’, denoted by <math>h = \{\hat{h}^i\}_{i=1}^r</math>. At time <math>t</math>, the network is governed by the following equations: <math>\hat{h}_t^{t\%r},g_t = LSTM(s_t, \hat{h}_{t-1}^{t\%r};\theta^{LSTM})</math> where % denotes the modulo operation and allows us to indicate which group of cores is currently being updated. At each time step, only the corresponding part of the state is updated and the output is pooled across the previous c outputs. This allows the r groups of cores inside the dLSTM to preserve the memories for long periods, yet the dLSTM as a whole is still able to process and learn from every input experience and is also able to update its output at every step.<br />
<br />
=Experiments=<br />
The baseline the authors are using is a recurrent LSTM[6] network on top of a representation learned by a CNN. The A3C method[5][16] is used for all reinforcement learning experiments. Backpropagation through time (BPTT)[8] is run after K forward passes of a network or if a terminal signal is received. For each method, 100 experiments were run. A training epoch is defined as one million observations. The authors seemed to have ignored Deep Q Learning while comparing the performance results. Speedy Q-learning [14], a new variant of Q-learning algorithm, deals with the problem of slow rate of convergence ( when discount factor <math>\gamma</math> is close to 1) and achieves a slightly better rate of convergence than other model-based methods. Perhaps, comparisons with these methods could truly assess the power improvements of FeUdal Networks.<br />
<br />
==Montezuma’s Revenge==<br />
Montezuma’s revenge is a prime example of an environment with sparse rewards. FuN starts learning much earlier and achieves much higher scores. It takes > 300 epochs for LSTM to reach the score 400, which corresponds to solving the first room (take the key, open a door). FuN solves the first room in less than 200 epochs and immediately moves on to explore further, eventually visiting several other rooms and scoring up to 2600 points.<br />
<br />
<center><br />
[[File:feudal_figure2.png|900px]]<br />
</center><br />
<br />
==ATARI==<br />
The experiment was run on a diverse set of ATARI games, some of which involve long-term credit assignment and some which are more reactive. Enduro stands out as all the LSTM agents completely fail at it. Frostbite is a hard game that requires both long-term credit assignment and good exploration. The best-performing frostbite agent is FuN with 0.95 Manager discount, which outperforms the rest by a factor of 7. The other results can be seen in the figure.<br />
<br />
<center><br />
[[File:feudal_figure4.png|900px]]<br />
</center><br />
<br />
==Comparing the option-critic architecture==<br />
FuN network was run on the same games as Option-Critic (Asterix, Ms. Pacman, Seaquest, and Zaxxon) and after 200 epochs it achieves a similar score on Seaquest, doubles it on Ms. Pacman, more than triples it on Zaxxon and gets more than 20x improvement on Asterix.<br />
<br />
<center><br />
[[File:feudal_figure7.png]]<br />
</center><br />
<br />
==Memory in Labyrinth==<br />
DeepMind Lab (Beattie et al., 2016) is a first-person 3D game platform extended from OpenArena. The games on which the experiments were run on include a Water maze, T-maze, and Non-match (which is a visual memorization task). FuN consistently outperforms the LSTM baseline – it learns faster and also reaches a higher final reward. Interestingly, the LSTM agent doesn’t appear to use its memory for water maze task at all, always circling the maze at the roughly the same radius.<br />
<br />
<center><br />
[[File:feudal_figure5.png|800px]]<br />
[[File:feudal_figure6.png|800px]]<br />
</center><br />
<br />
==Ablative Analysis==<br />
Empirical evaluation of the main contributions of this paper:<br />
<br />
===Transition policy gradient===<br />
Experiments were run on modified FuN networks in which: 1) the Managers output g is trained with gradients coming directly from the Worker and no intrinsic reward is used, 2) g is learned using a standard<br />
policy gradient approach with the Manager emitting the mean of a Gaussian distribution from which goals are sampled, 3) a variant of FuN in which g specifies absolute, rather than relative/directional, goals and 4) a purely feudal version of FuN – in which the Worker is trained from the intrinsic reward alone. The experiments (Figure 8) reveal that, although alternatives do work to some degree their performance is significantly inferior.<br />
<br />
<center><br />
[[File:feudal_figure8.png|900px]]<br />
</center><br />
<br />
===Temporal resolution ablations===<br />
To test the effectiveness of the dilation LSTM, FuN was compared with two baselines 1) the Manager uses a vanilla LSTM with no dilation; 2) FuN with Manager’s prediction horizon c = 1. The non-dilated LSTM fails catastrophically, most likely overwhelmed by the recurrent gradient. Reducing the horizon c to 1 did hurt the performance, although not that much, which means that even at high temporal resolution Manager captures certain properties of the underlying MDP.<br />
<br />
<center><br />
[[File:feudal_figure10.png|900px]]<br />
</center><br />
<br />
===Intrinsic motivation weight===<br />
Evaluates the effect of weight <math>&alpha;</math> which regulates the relative weight of intrinsic reward. Figure below shows scatter plots of agents final score vs α hyper-parameter where there is a clear improvement in score for high <math>\alpha</math> in some games.<br />
<br />
<center><br />
[[File:feudal_figure11.png|900px]]<br />
</center><br />
<br />
===Dilate LSTM agent baseline===<br />
For this experiment, just the dLSTM is used in an agent on top of a CNN, without the rest of FuN structures. Figure below plots the learning curves for FuN, LSTM, and dLSTM agents. dLSTM generally underperforms both LSTM and FuN.<br />
<br />
<center><br />
[[File:feudal_figure12.png|900px]]<br />
</center><br />
<br />
===ATARI action repeat transfer===<br />
This experiment is to demonstrate the advantage of FeUdal Network, i.e. separating policy and primitive operations. It also implies that the transition policy can be transferred between agents with a different embodiment, for example, across agents with different action repeat on ATARI. The figure below shows the corresponding learning curves. The transferred FuN agent (green curve) significantly outperforms every other method.<br />
<br />
<center><br />
[[File:feudal_figure9.png|900px]]<br />
</center><br />
<br />
=Conclusion=<br />
FuN is fully differentiable neural network with two levels of hierarchies, currently holds state-of-the-art score in the Atari game, Montezuma's revenge among HRL methods. It is a novel approach to hierarchical reinforcement learning which separates the goal setting behavior from the generation of action primitives. Benefits of this architecture includes better long-term credit assignment, longer memory, emergence of sub-policies as manager learns to select latent goals that maximise extrinsic reward which have been emperically shown in the paper.<br />
<br />
Deeper hierarchies by setting goals at multiple time scales is an avenue for further research. The modular structure looks promising for transfer and multitask learning as well.<br />
<br />
An implementation of this paper can be found on [https://github.com/dmakian/feudal_networks Github].<br />
<br />
As an additional read, as per [https://web.stanford.edu/class/cs224n/reports/2762090.pdf this report], they have incorporated natural language instructions in the hierarchical RL model to beat the state of the art ATARI systems.<br />
<br />
=References=<br />
#Bacon, Pierre-Luc, Precup, Doina, and Harb, Jean. The option-critic architecture. In AAAI, 2017.<br />
#Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation.In NIPS, 2016a.<br />
#Dayan, Peter and Hinton, Geoffrey E. Feudal reinforcement learning. In NIPS. Morgan Kaufmann Publishers,1993.<br />
#Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.<br />
#Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. ICML, 2016.<br />
#Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 1997.<br />
#Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. ICLR, 2016.<br />
#Mozer, Michael C. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 1989.<br />
#Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver,David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.<br />
#A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.<br />
# https://www.quora.com/What-is-hierachical-reinforcement-learning<br />
# Tutorial for Hierarchial Reinforcement Learning: https://www.youtube.com/watch?v=K5MlmO0UJtI<br />
# Videos of FUN agent playing various Atari games can be found in supplementary file accessed through: http://proceedings.mlr.press/v70/vezhnevets17a.html<br />
#Gheshlaghi Azar, Mohammad; Munos, Remi; Ghavamzadeh, Mohammad; Kappen, Hilbert J. (2011). "Speedy Q-Learning". Advances in Neural Information Processing Systems. 24: 2411–2419.<br />
#https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution<br />
#https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=FeUdal_Networks_for_Hierarchical_Reinforcement_Learning&diff=31749FeUdal Networks for Hierarchical Reinforcement Learning2017-12-03T05:54:38Z<p>Jdeng: /* Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Reinforcement learning (RL) is a facet of machine learning which inspired by behaviourist psychology wherein the algorithm is takes on the required actions to maximize the cumulative reward. Even though deep reinforcement learning has been hugely successful in a variety of domains, it has not been able to succeed in environments which have sparsely spaced reward signals. Take for instance the infamous<br />
Montezuma’s Revenge ATARI game ([https://www.reddit.com/r/MachineLearning/comments/45fa9o/why_montezuma_revenge_doesnt_work_in_deepmind/ a related discussion on Reddit]) which encounters a major challenge of long-term credit assignment. Essentially, the agent is not able to attribute a reward to an action taken several timesteps back. <br />
<br />
This paper proposes a hierarchical reinforcement learning architecture (HRL), called FeUdal Networks (FuN), which has been inspired by Feudal Reinforcement Learning (FRL)[3]. One of the main characteristics of FRL is that the goals can be generated in a top-down fashion, and goal setting can be decoupled from goal achievement. The level in the hierarchy communicates and delegates work to the level below it but doesn't specify how to do so. It is a fully-differentiable neural network with two levels of hierarchy – a Manager module at the top level and a Worker module below. The Manager sets abstract goals, which are learned, at a lower temporal resolution in a latent state-space. The Worker operates at a higher temporal resolution and produces primitive actions at every tick of the environment, motivated to follow the goals received from Manager, by an intrinsic reward.<br />
<br />
The key contributions of the authors in this paper are: <br />
# A consistent, end-to-end differentiable FRL inspired HRL;<br />
# A novel, approximate transition policy gradient update for training the Manager;<br />
# The use of goals that are directional rather than absolute in nature; <br />
# Dilated LSTM – a novel RNN design for the Manager that allows gradients to flow through large hops in time.<br />
<br />
The experiments conducted on several tasks which involve sparse rewards show that FuN significantly outperforms a strong baseline agent on tasks that involve long-term credit assignment and memorization.<br />
<br />
= Related Work =<br />
<br />
Several hierarchical reinforcement learning models were proposed to solve this problem. The options framework [4] considers the problem with a two-level hierarchy, with options being typically learned using sub-goals and ‘pseudo-rewards’ that are provided explicitly. Whereas, the option-critic architecture[1] uses the policy gradient theorem for learning options in an end-to-end fashion. A problem with learning options end-to-end is that they tend to a trivial solution where: (i) only one option is active, which solves the whole task; (ii) a policy-over-options changes options at every step, micro-managing the behavior. The authors state that the option-critic architecture is the only other end-to-end trainable system with sub-policies. A key difference between the authors' approach and the options framework is that the top level (manager) produces a meaningful and explicit goal for the bottom level (worker) to achieve.<br />
<br />
Non-hierarchical deep RL (non-HRL) methods using auxiliary losses and rewards such as pseudo count for exploration[2] have significantly improved results by stimulating agents to explore new parts of the state space. The UNREAL agent[9] is another non-HRL method that showed a strong improvement using unsupervised auxiliary tasks. The reason that such adjustment to conduct policy search is improving the learning results is because in some of the more complex environments, the process of learning policies that lead to onset of a reward is long and therefore harder to train the model. The authors called this sparsity of reward and they utilized auxiliary task of reward prediction, which means predicting the onset of immediate reward given some historical context.<br />
<br />
= Preliminaries = <br />
=== Long short-term memory ===<br />
The Long short-term memory network is a simple RNN which is often used an a building block of a larger recurrent network. An LSTM network consists of four main components: a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals. The three gates are often interpreted as artificial neurons as in a MLP neural network, and the parameters related to the gates are also learnt during training. This network was designed to mimic short-term memory which can last for a long period of time. LSTMs are well suited for the classification, processing and prediction of time series given time lags of unknown size and duration. An LSTM structure is used for the Manager in this Reinforcement Learning paper.<br />
<br />
= Model =<br />
<br />
[[File:feudal_network_model_diagram.png|frame]]<br />
<br />
A high-level explanation of the model is as follows: <br />
<br />
The Manager computes a latent state representation <math>s_t</math> and outputs a goal vector <math>g_t</math> . The Worker outputs actions based on the environment observation, its own state, and the Manager’s goal. A perceptual module computes intermediate representation, <math>z_t</math> of the environment observation <math>x_t</math>, and is shared as input by both Manager and Worker. The Manager’s goals <math>g_t</math> are trained using an approximate transition policy gradient. The Worker is then trained via intrinsic reward which stimulates it to output actions that will achieve the goals set by the Manager.<br />
<br />
<center><br />
[[File:model_definition.png|500px]]<br />
</center><br />
<br />
Manager and Worker are recurrent networks (<math>{h^M}</math> and <math>{h^W}</math> being their internal states). <math>\phi</math> is a linear transform that maps a goal <math>g_t</math> into an embedding vector <math>w_t \in {R^k}</math> , which is then combined with matrix <math>U_t</math> (Worker's output) via a matrix-vector product to produce policy <math>\pi</math> – vector of probabilities over primitive actions. The projection <math>\phi</math> is linear, with no biases, and is learnt with gradients coming from the Worker’s actions.Since <math>\phi</math> has no biases it can never produce a constant non-zero vector – which is the only way the setup could ignore the Manager’s input. This makes sure that the goal output by the Manager always influences the final policy. In summary, my understanding of the model of goal embedding is: (i) Looking at this model wherein the state is inndependent of, the worker defines a set of partitions of the goal sphere via the embedding function; so each dimension now responds positively to one half of the sphere and negatively to the other half. (ii) Conditioned on the state, the worker then describes each action as a weighted sum of those partitions. Hence, given the state, the worker defines regions of the goal sphere where each action is more likely to be taken.<br />
<br />
===Learning===<br />
The learning considers a standard reinforcement learning setup where the goal of the agent is to maximize the discounted return <math>R_t = \sum_{k=0}^{&infin;} \gamma^k r_{t+k+1}</math>; where <math>\gamma \in [0,1]; r_t</math> is the reward from environment for action at timestep, <math>t</math>. The agent's behavior is defined by its action-selection policy, <math>\pi</math>.<br />
<br />
Since FuN is fully differentiable, the authors could have trained it end-to-end using a policy gradient algorithm operating on the actions taken by the Worker such the outputs <math>g</math> of the Manager would be trained by gradients coming from the Worker. This, however, would deprive Manager’s goals <math>g</math> of any semantic meaning, making them just internal latent variables of the model. So instead, Manager is independently trained to predict advantageous directions (transitions) in state space and to intrinsically reward the Worker to follow these directions.<br />
<br />
Update rule for manager:<br />
<br />
<br />
<center><br />
<math>\nabla g_t = A_t^M \nabla_\theta d_{cos}(s_{t+c} - s_t, g_t(\theta))</math><br />
</center><br />
<br />
<br />
In above equation, <math>d_{cos}(\alpha, \beta) = \alpha^T \beta/(|\alpha||\beta|)</math> is the cosine similarity between two vectors and <math>A_t^M = R_t - V_t^M(x_t,\theta)</math> is the Manager’s advantage function, computed using a value function estimate <math>V_t^M(x_t,\theta)</math> from the internal critic. Here c is an event horizon for the Manager to optimize its direction on. It must be treated as a hyperparameter of the model. It controls the temporal resolution of the Manager. Notice that the last c goals of manager are also first pooled by summation and then embedded into a vector. That makes the conditioning from the manager vary smoothly. <br />
<br />
The intrinsic reward that encourages the Worker to follow the goals are defined as:<br />
<br />
<br />
<center><br />
<math>R_t^I = 1/c \sum_{i=1}^c d_{cos}(s_t - s_{t-i}, g_{t-i})</math> <br />
</center><br />
<br />
<br />
Compared to FRL[3], which advocated concealing the reward from the environment from lower levels of the hierarchy, the Worker in FuN network is trained using an advantage actor-critic[5] to maximise a weighted sum <math>R_t + &alpha; R_t^I</math> , where <math>&alpha;</math> is a hyper-parameter that regulates the influence of the intrinsic reward:<br />
<br />
<br />
<center><br />
<math>\nabla {\pi}_t = A_t^D \nabla_\theta log \pi (a_t|x_t;\theta)</math><br />
</center><br />
<br />
<br />
The Advantage function <math>A_t^D = (R_t + \alpha R_t^I - V_t^D(x_t;\theta))</math> is calculated using an internal critic, which estimates the value functions for both rewards.<br />
<br />
The authors also make note of the fact that the Worker and Manager can have different discount factors $\gamma$ for computing return. This allows the Worker to focus on more immediate rewards while the Manager can make decisions over a longer time horizon.<br />
<br />
===Transition Policy Gradient===<br />
The update rule for the Manager given above is a novel form of policy gradient with respect to a ''model'' of the Worker’s behavior. The Worker can follow a complex trajectory but it is not necessarily required to learn from these samples. If the trajectories can be predicted, by modeling the transitions, then the policy gradient of the predicted transition can be followed instead of the Worker's actual path. FuN assumes a particular form for the transition model: that the direction in state-space, <math>s_{t+c} − s_t</math>, follows a von Mises-Fisher distribution (it is a probability distribution on the (p-1)-dimensional sphere in R<sup>p</sup>, for more information [15]).<br />
<br />
=Architecture=<br />
The perceptual module <math>f^{percept}</math> is a convolutional network (CNN) followed by a fully connected layer. Each convolutional and fully-connected layer is followed by a rectifier non-linearity. <math>f^{Mspace}</math>, which is another fully connected layer followed by a rectifier non-linearity, is used to compute the state space, which the Manager uses to formulate goals. The Worker’s recurrent network <math>f^{Wrnn}</math> is a standard LSTM[6].<br />
<br />
<br />
The Manager uses a novel architecture called a dilated LSTM (dLSTM), which operates at lower temporal resolution than the data stream. It is similar to dilated convolutional networks[7] and clockwork RNN. For a dilation radius r, the network is composed of r separate groups of sub-states or ‘cores’, denoted by <math>h = \{\hat{h}^i\}_{i=1}^r</math>. At time <math>t</math>, the network is governed by the following equations: <math>\hat{h}_t^{t\%r},g_t = LSTM(s_t, \hat{h}_{t-1}^{t\%r};\theta^{LSTM})</math> where % denotes the modulo operation and allows us to indicate which group of cores is currently being updated. At each time step, only the corresponding part of the state is updated and the output is pooled across the previous c outputs. This allows the r groups of cores inside the dLSTM to preserve the memories for long periods, yet the dLSTM as a whole is still able to process and learn from every input experience and is also able to update its output at every step.<br />
<br />
=Experiments=<br />
The baseline the authors are using is a recurrent LSTM[6] network on top of a representation learned by a CNN. The A3C method[5][16] is used for all reinforcement learning experiments. Backpropagation through time (BPTT)[8] is run after K forward passes of a network or if a terminal signal is received. For each method, 100 experiments were run. A training epoch is defined as one million observations. The authors seemed to have ignored Deep Q Learning while comparing the performance results. Speedy Q-learning [14], a new variant of Q-learning algorithm, deals with the problem of slow rate of convergence ( when discount factor <math>\gamma</math> is close to 1) and achieves a slightly better rate of convergence than other model-based methods. Perhaps, comparisons with these methods could truly assess the power improvements of FeUdal Networks.<br />
<br />
==Montezuma’s Revenge==<br />
Montezuma’s revenge is a prime example of an environment with sparse rewards. FuN starts learning much earlier and achieves much higher scores. It takes > 300 epochs for LSTM to reach the score 400, which corresponds to solving the first room (take the key, open a door). FuN solves the first room in less than 200 epochs and immediately moves on to explore further, eventually visiting several other rooms and scoring up to 2600 points.<br />
<br />
<center><br />
[[File:feudal_figure2.png|900px]]<br />
</center><br />
<br />
==ATARI==<br />
The experiment was run on a diverse set of ATARI games, some of which involve long-term credit assignment and some which are more reactive. Enduro stands out as all the LSTM agents completely fail at it. Frostbite is a hard game that requires both long-term credit assignment and good exploration. The best-performing frostbite agent is FuN with 0.95 Manager discount, which outperforms the rest by a factor of 7. The other results can be seen in the figure.<br />
<br />
<center><br />
[[File:feudal_figure4.png|900px]]<br />
</center><br />
<br />
==Comparing the option-critic architecture==<br />
FuN network was run on the same games as Option-Critic (Asterix, Ms. Pacman, Seaquest, and Zaxxon) and after 200 epochs it achieves a similar score on Seaquest, doubles it on Ms. Pacman, more than triples it on Zaxxon and gets more than 20x improvement on Asterix.<br />
<br />
<center><br />
[[File:feudal_figure7.png]]<br />
</center><br />
<br />
==Memory in Labyrinth==<br />
DeepMind Lab (Beattie et al., 2016) is a first-person 3D game platform extended from OpenArena. The games on which the experiments were run on include a Water maze, T-maze, and Non-match (which is a visual memorization task). FuN consistently outperforms the LSTM baseline – it learns faster and also reaches a higher final reward. Interestingly, the LSTM agent doesn’t appear to use its memory for water maze task at all, always circling the maze at the roughly the same radius.<br />
<br />
<center><br />
[[File:feudal_figure5.png|800px]]<br />
[[File:feudal_figure6.png|800px]]<br />
</center><br />
<br />
==Ablative Analysis==<br />
Empirical evaluation of the main contributions of this paper:<br />
<br />
===Transition policy gradient===<br />
Experiments were run on modified FuN networks in which: 1) the Managers output g is trained with gradients coming directly from the Worker and no intrinsic reward is used, 2) g is learned using a standard<br />
policy gradient approach with the Manager emitting the mean of a Gaussian distribution from which goals are sampled, 3) a variant of FuN in which g specifies absolute, rather than relative/directional, goals and 4) a purely feudal version of FuN – in which the Worker is trained from the intrinsic reward alone. The experiments (Figure 8) reveal that, although alternatives do work to some degree their performance is significantly inferior.<br />
<br />
<center><br />
[[File:feudal_figure8.png|900px]]<br />
</center><br />
<br />
===Temporal resolution ablations===<br />
To test the effectiveness of the dilation LSTM, FuN was compared with two baselines 1) the Manager uses a vanilla LSTM with no dilation; 2) FuN with Manager’s prediction horizon c = 1. The non-dilated LSTM fails catastrophically, most likely overwhelmed by the recurrent gradient. Reducing the horizon c to 1 did hurt the performance, although not that much, which means that even at high temporal resolution Manager captures certain properties of the underlying MDP.<br />
<br />
<center><br />
[[File:feudal_figure10.png|900px]]<br />
</center><br />
<br />
===Intrinsic motivation weight===<br />
Evaluates the effect of weight <math>&alpha;</math> which regulates the relative weight of intrinsic reward. Figure below shows scatter plots of agents final score vs α hyper-parameter where there is a clear improvement in score for high <math>\alpha</math> in some games.<br />
<br />
<center><br />
[[File:feudal_figure11.png|900px]]<br />
</center><br />
<br />
===Dilate LSTM agent baseline===<br />
For this experiment, just the dLSTM is used in an agent on top of a CNN, without the rest of FuN structures. Figure below plots the learning curves for FuN, LSTM, and dLSTM agents. dLSTM generally underperforms both LSTM and FuN.<br />
<br />
<center><br />
[[File:feudal_figure12.png|900px]]<br />
</center><br />
<br />
===ATARI action repeat transfer===<br />
This experiment is to demonstrate the advantage of FeUdal Network, i.e. separating policy and primitive operations. It also implies that the transition policy can be transferred between agents with a different embodiment, for example, across agents with different action repeat on ATARI. The figure below shows the corresponding learning curves. The transferred FuN agent (green curve) significantly outperforms every other method.<br />
<br />
<center><br />
[[File:feudal_figure9.png|900px]]<br />
</center><br />
<br />
=Conclusion=<br />
FuN is fully differentiable neural network with two levels of hierarchies, currently holds state-of-the-art score in the Atari game, Montezuma's revenge among HRL methods. It is a novel approach to hierarchical reinforcement learning which separates the goal setting behavior from the generation of action primitives. Benefits of this architecture includes better long-term credit assignment, longer memory, emergence of sub-policies as manager learns to select latent goals that maximise extrinsic reward which have been emperically shown in the paper.<br />
<br />
Deeper hierarchies by setting goals at multiple time scales is an avenue for further research. The modular structure looks promising for transfer and multitask learning as well.<br />
<br />
An implementation of this paper can be found on [https://github.com/dmakian/feudal_networks Github].<br />
<br />
As an additional read, as per [https://web.stanford.edu/class/cs224n/reports/2762090.pdf this report], they have incorporated natural language instructions in the hierarchical RL model to beat the state of the art ATARI systems.<br />
<br />
=References=<br />
#Bacon, Pierre-Luc, Precup, Doina, and Harb, Jean. The option-critic architecture. In AAAI, 2017.<br />
#Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation.In NIPS, 2016a.<br />
#Dayan, Peter and Hinton, Geoffrey E. Feudal reinforcement learning. In NIPS. Morgan Kaufmann Publishers,1993.<br />
#Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.<br />
#Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. ICML, 2016.<br />
#Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 1997.<br />
#Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. ICLR, 2016.<br />
#Mozer, Michael C. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 1989.<br />
#Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver,David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.<br />
#A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.<br />
# https://www.quora.com/What-is-hierachical-reinforcement-learning<br />
# Tutorial for Hierarchial Reinforcement Learning: https://www.youtube.com/watch?v=K5MlmO0UJtI<br />
# Videos of FUN agent playing various Atari games can be found in supplementary file accessed through: http://proceedings.mlr.press/v70/vezhnevets17a.html<br />
#Gheshlaghi Azar, Mohammad; Munos, Remi; Ghavamzadeh, Mohammad; Kappen, Hilbert J. (2011). "Speedy Q-Learning". Advances in Neural Information Processing Systems. 24: 2411–2419.<br />
#https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution<br />
#https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_the_Number_of_Neurons_in_Deep_Networks&diff=31748Learning the Number of Neurons in Deep Networks2017-12-03T05:20:31Z<p>Jdeng: /* References */</p>
<hr />
<div>='''Introduction'''=<br />
<br />
Due to the availability of massive datasets and powerful computational infrastructure, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Modelling and Computer Vision. In essence, deep learning algorithms are a re-branding of neural networks from the 1950s, wherein we add multiple processing layers that we can now compute applications due to GPU power. It is important to note that the multiple processing layers (i.e. hidden layers) learn one-level of abstraction of data - this does not mean that we need to have numerous layers, the goal is to find the perfect number of layers such that the data that we are trying to generalize does not over-fit. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by trial and error manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.<br />
<br />
In this paper, the authors used an approach to automatically select the number of neurons in each layer when we learn the network; a task which has mostly been done through trial and error as yet. Their approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than training an initial network as a pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]) and reducing neurons later as a post-processing step. We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, the approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.<br />
<br />
In the experiments on several image recognition datasets, the authors showed the effectiveness of this approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks and occupies less memory.<br />
<br />
='''Related Work'''=<br />
<br />
Recent research tends to produce very deep networks. Building very deep networks mean we need to learn more parameters, which leads to significant memory costs as well as a reduction in training speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] in the process of learning. However, we know shallow networks have fewer parameters, so that it cannot handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts with a deep network and then reduces a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] that have little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network. Recent approaches for the destructive model focus on learning a shallower or thinner network that mimics the behavior of an initial deeper network.<br />
<br />
Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works have proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network since it acts as a post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons in each layer is determined manually. A very similar work using the group lasso method for CNN was previously done in [Liu et al., 2015]. The big-picture idea appears to be very similar but they differ in details of methodology where [Liu et al.. 2015] involves computing the network Hessian and is repeated multiple times over the learning process. This is computationally expensive when dealing with large scale datasets and as a consequence, these techniques are no longer pursued in the current large-scale era.<br />
<br />
='''Model Training and Model Selection'''=<br />
<br />
In general, a deep network has L layers containing linear operations on their inputs, intertwined with non-linear activation functions such as '''Rectified Linear Units(RELU) or sigmoids''' and,<br />
potentially, pooling operations. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and where $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Here, $w_{l}^{n}$ is a linear operator acting on the layer’s input and $b_{l}^{n}$ is a bias. Given an input $x$, under the linear, non-linear and pooling operations, we obtain the output $\hat{y}=f(x,\Theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.<br />
<br />
At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:<br />
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Choices for such a regularizer include weight-decay, i.e., $\gamma(.)$ is the (squared) $\ell_{2}$-norm, of sparsity-inducing norms, e.g., the $\ell_{1}$-norm. The goal of this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieves this goal. Here, we make use of the '''group sparsity''' (GS) [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too many neurons in the first few layers so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.<br />
<br />
The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''' (SGL), which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1})$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.<br />
<br />
This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression (explained in Hastie et al.,[https://web.stanford.edu/~hastie/Papers/ESLII.pdf The Elements of Statisticial Learning], section 3.4). In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus an L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus an L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms. <br />
<br />
To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it: <br />
<br />
We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$ <br />
<br />
<br />
Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$ <br />
<br />
Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$<br />
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem: <br />
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$<br />
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$<br />
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero. Additionally, when examining fully-connected layers, the neurons acting on output of zeroed-out neurons of the previous layer also become useless, and are removed accordingly.<br />
<br />
='''Experiment'''=<br />
<br />
==='''Set Up'''===<br />
<br />
They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a]. <br />
<br />
For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$ (this is shown to reduce the number of parameters while maintaining the accuracy of the original network). $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.<br />
<br />
For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.<br />
<br />
==='''Results'''===<br />
<br />
[[File:imageNet.png]]<br />
<br />
The above table shows the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$. <br />
<br />
[[File:44.png]]<br />
<br />
[[File:2.png]]<br />
<br />
The above figures report the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in the number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in the number of neurons is spread all the layers. <br />
<br />
[[File:ICDA.png]]<br />
<br />
Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.<br />
<br />
All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.<br />
<br />
='''Analysis on Testing'''=<br />
<br />
The algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network but also decreases the computational memory cost and increases the speed. <br />
<br />
[[File:table2.png]]<br />
<br />
The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time. The runtimes were obtained using a single Tesla K20m and memory estimations using RGB-images of size 224 × 224 for Ours-BNet, Ours-Dec8-640_GS and Ours-Dec8-768_GS, and gray level images of size 32 × 32 for Ours-Dec3-GS<br />
<br />
='''Conclusion'''=<br />
<br />
In this paper, the authors have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, they found the approach not only reduces the number of parameters in our model but also saves the computation memory and increases the speed at test time. However, the limitation of the approach is that the number of layers in the network remains fixed.<br />
<br />
='''Critique'''=<br />
The authors of the paper state that "...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have<br />
two different weights: a relatively small one for the first few layers, and a larger weight for the<br />
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus<br />
retains enough information for the remaining ones." However, the authors fail to present any guidance as to what gets counted as "the first few layers" and what the relative sizes for the two weights should be even after we have chosen the "first few layers". Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper. Several numerical comparisons should be carried out to allow further discussion on this question.<br />
<br />
The parameters $\beta_l$ is important for the performance of the network. But the author does not provide enough details how to tune these parameters, using cross-validation or something else. And the performance of the model with various parameters setting $\beta_l$ is interesting and important to understand the robustness of this method. <br />
<br />
The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check. On top of that, authors could have compared to conventional L1 and L2 regularization which can reduce the number of non-zero parameters, as well as other techniques such as making setting small weight values to zero and performing fine tuning as done in https://www.microsoft.com/en-us/research/publication/exploiting-sparseness-in-deep-neural-networks-for-large-vocabulary-speech-recognition/. Also, the author could have applied the theory of ridge and Lasso regression to analyze the effect of the regularization mathematically.<br />
<br />
A rather reliable method of experimentation to compare the performance and accuracy has been left out. The authors have not stated any comparisons of this method with the Dropout method [Srivastava,2014], which is similar in terms of the physical effects on the network. The authors state that: "[...] Note that none of the standard regularizers mentioned above achieve this goal: The former favors small parameter values, and the latter tends to cancel out individual parameters, but not complete neurons." This draws a direct comparison to regularizers, ignoring that dropout methods exactly remove complete neurons.<br />
<br />
It would have been interesting to see the performance gain on real time applications such as YOLO or SSD object detectors that are being used in self-driving cars by incorporating the approach presented by the paper into its convolution neural nets. Meanwhile, as an interesting extension, it would be better if the authors could test this group sparse regularization in deep reinforcement learning, where a convolution neural network is used to predict the reward.<br />
<br />
As an important property of regularizer, the influence of the group sparse regularization on avoiding overfitting is yet unknown. The number of epochs increases or decreases after applying this regularization to achieve the same accuracy can be further studied.<br />
<br />
It seems as though the authors' claim that their approach "automatically determines the number of neurons" is overstated at best. In reality, this approach can find redundancy is an overspecified model, which provides the benefit of size reduction as outlined. This provides non-trivial benefits, but it has no way of addressing the (albeit less likely) issue of an underspecified model. In conjunction with the fact that the number of layers must remain fixed makes, this method has a feel of smart regularization, as opposed to size learning. Coupled together with the lack of dropout comparison leaves doubts regarding the efficacy of this technique for model specification. If a model must be intentionally over-specified to learn the parameters, then it is hard to claim memory reduction benefits vis-a-vis any technique stemming from an underspecified model. In any case, this may serve as an efficient technique for many of the networks used practically today which are designed to be extraordinarily massive, but labelling it a means of sizing a network is erroneous.<br />
<br />
It is very strange that the parameters of each neuron in layer are grouped in the group sparsity. According to (Friedman et al. 2010), the group sparsity acts like the lasso at the group level : an entire group of predictors may drop out of the model. It means that the whole neuron in a layer may drop out, and the connection of layers in the neural network is broken.<br />
<br />
='''References'''=<br />
<br />
P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.<br />
<br />
M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.<br />
<br />
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.<br />
<br />
I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.<br />
<br />
G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.<br />
<br />
N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.<br />
<br />
H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.<br />
<br />
Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf<br />
<br />
Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.<br />
<br />
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929-1958.<br />
<br />
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "A note on the group lasso and a sparse group lasso." arXiv preprint arXiv:1001.0736 (2010).<br />
<br />
Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):<br />
# http://www.onmyphd.com/?p=proximal.operator<br />
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_the_Number_of_Neurons_in_Deep_Networks&diff=31747Learning the Number of Neurons in Deep Networks2017-12-03T05:19:37Z<p>Jdeng: /* Critique */</p>
<hr />
<div>='''Introduction'''=<br />
<br />
Due to the availability of massive datasets and powerful computational infrastructure, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Modelling and Computer Vision. In essence, deep learning algorithms are a re-branding of neural networks from the 1950s, wherein we add multiple processing layers that we can now compute applications due to GPU power. It is important to note that the multiple processing layers (i.e. hidden layers) learn one-level of abstraction of data - this does not mean that we need to have numerous layers, the goal is to find the perfect number of layers such that the data that we are trying to generalize does not over-fit. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by trial and error manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.<br />
<br />
In this paper, the authors used an approach to automatically select the number of neurons in each layer when we learn the network; a task which has mostly been done through trial and error as yet. Their approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than training an initial network as a pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]) and reducing neurons later as a post-processing step. We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, the approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.<br />
<br />
In the experiments on several image recognition datasets, the authors showed the effectiveness of this approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks and occupies less memory.<br />
<br />
='''Related Work'''=<br />
<br />
Recent research tends to produce very deep networks. Building very deep networks mean we need to learn more parameters, which leads to significant memory costs as well as a reduction in training speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] in the process of learning. However, we know shallow networks have fewer parameters, so that it cannot handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts with a deep network and then reduces a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] that have little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network. Recent approaches for the destructive model focus on learning a shallower or thinner network that mimics the behavior of an initial deeper network.<br />
<br />
Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works have proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network since it acts as a post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons in each layer is determined manually. A very similar work using the group lasso method for CNN was previously done in [Liu et al., 2015]. The big-picture idea appears to be very similar but they differ in details of methodology where [Liu et al.. 2015] involves computing the network Hessian and is repeated multiple times over the learning process. This is computationally expensive when dealing with large scale datasets and as a consequence, these techniques are no longer pursued in the current large-scale era.<br />
<br />
='''Model Training and Model Selection'''=<br />
<br />
In general, a deep network has L layers containing linear operations on their inputs, intertwined with non-linear activation functions such as '''Rectified Linear Units(RELU) or sigmoids''' and,<br />
potentially, pooling operations. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and where $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Here, $w_{l}^{n}$ is a linear operator acting on the layer’s input and $b_{l}^{n}$ is a bias. Given an input $x$, under the linear, non-linear and pooling operations, we obtain the output $\hat{y}=f(x,\Theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.<br />
<br />
At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:<br />
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Choices for such a regularizer include weight-decay, i.e., $\gamma(.)$ is the (squared) $\ell_{2}$-norm, of sparsity-inducing norms, e.g., the $\ell_{1}$-norm. The goal of this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieves this goal. Here, we make use of the '''group sparsity''' (GS) [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too many neurons in the first few layers so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.<br />
<br />
The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''' (SGL), which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1})$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.<br />
<br />
This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression (explained in Hastie et al.,[https://web.stanford.edu/~hastie/Papers/ESLII.pdf The Elements of Statisticial Learning], section 3.4). In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus an L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus an L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms. <br />
<br />
To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it: <br />
<br />
We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$ <br />
<br />
<br />
Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$ <br />
<br />
Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$<br />
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem: <br />
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$<br />
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$<br />
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero. Additionally, when examining fully-connected layers, the neurons acting on output of zeroed-out neurons of the previous layer also become useless, and are removed accordingly.<br />
<br />
='''Experiment'''=<br />
<br />
==='''Set Up'''===<br />
<br />
They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a]. <br />
<br />
For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$ (this is shown to reduce the number of parameters while maintaining the accuracy of the original network). $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.<br />
<br />
For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.<br />
<br />
==='''Results'''===<br />
<br />
[[File:imageNet.png]]<br />
<br />
The above table shows the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$. <br />
<br />
[[File:44.png]]<br />
<br />
[[File:2.png]]<br />
<br />
The above figures report the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in the number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in the number of neurons is spread all the layers. <br />
<br />
[[File:ICDA.png]]<br />
<br />
Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.<br />
<br />
All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.<br />
<br />
='''Analysis on Testing'''=<br />
<br />
The algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network but also decreases the computational memory cost and increases the speed. <br />
<br />
[[File:table2.png]]<br />
<br />
The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time. The runtimes were obtained using a single Tesla K20m and memory estimations using RGB-images of size 224 × 224 for Ours-BNet, Ours-Dec8-640_GS and Ours-Dec8-768_GS, and gray level images of size 32 × 32 for Ours-Dec3-GS<br />
<br />
='''Conclusion'''=<br />
<br />
In this paper, the authors have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, they found the approach not only reduces the number of parameters in our model but also saves the computation memory and increases the speed at test time. However, the limitation of the approach is that the number of layers in the network remains fixed.<br />
<br />
='''Critique'''=<br />
The authors of the paper state that "...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have<br />
two different weights: a relatively small one for the first few layers, and a larger weight for the<br />
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus<br />
retains enough information for the remaining ones." However, the authors fail to present any guidance as to what gets counted as "the first few layers" and what the relative sizes for the two weights should be even after we have chosen the "first few layers". Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper. Several numerical comparisons should be carried out to allow further discussion on this question.<br />
<br />
The parameters $\beta_l$ is important for the performance of the network. But the author does not provide enough details how to tune these parameters, using cross-validation or something else. And the performance of the model with various parameters setting $\beta_l$ is interesting and important to understand the robustness of this method. <br />
<br />
The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check. On top of that, authors could have compared to conventional L1 and L2 regularization which can reduce the number of non-zero parameters, as well as other techniques such as making setting small weight values to zero and performing fine tuning as done in https://www.microsoft.com/en-us/research/publication/exploiting-sparseness-in-deep-neural-networks-for-large-vocabulary-speech-recognition/. Also, the author could have applied the theory of ridge and Lasso regression to analyze the effect of the regularization mathematically.<br />
<br />
A rather reliable method of experimentation to compare the performance and accuracy has been left out. The authors have not stated any comparisons of this method with the Dropout method [Srivastava,2014], which is similar in terms of the physical effects on the network. The authors state that: "[...] Note that none of the standard regularizers mentioned above achieve this goal: The former favors small parameter values, and the latter tends to cancel out individual parameters, but not complete neurons." This draws a direct comparison to regularizers, ignoring that dropout methods exactly remove complete neurons.<br />
<br />
It would have been interesting to see the performance gain on real time applications such as YOLO or SSD object detectors that are being used in self-driving cars by incorporating the approach presented by the paper into its convolution neural nets. Meanwhile, as an interesting extension, it would be better if the authors could test this group sparse regularization in deep reinforcement learning, where a convolution neural network is used to predict the reward.<br />
<br />
As an important property of regularizer, the influence of the group sparse regularization on avoiding overfitting is yet unknown. The number of epochs increases or decreases after applying this regularization to achieve the same accuracy can be further studied.<br />
<br />
It seems as though the authors' claim that their approach "automatically determines the number of neurons" is overstated at best. In reality, this approach can find redundancy is an overspecified model, which provides the benefit of size reduction as outlined. This provides non-trivial benefits, but it has no way of addressing the (albeit less likely) issue of an underspecified model. In conjunction with the fact that the number of layers must remain fixed makes, this method has a feel of smart regularization, as opposed to size learning. Coupled together with the lack of dropout comparison leaves doubts regarding the efficacy of this technique for model specification. If a model must be intentionally over-specified to learn the parameters, then it is hard to claim memory reduction benefits vis-a-vis any technique stemming from an underspecified model. In any case, this may serve as an efficient technique for many of the networks used practically today which are designed to be extraordinarily massive, but labelling it a means of sizing a network is erroneous.<br />
<br />
It is very strange that the parameters of each neuron in layer are grouped in the group sparsity. According to (Friedman et al. 2010), the group sparsity acts like the lasso at the group level : an entire group of predictors may drop out of the model. It means that the whole neuron in a layer may drop out, and the connection of layers in the neural network is broken.<br />
<br />
='''References'''=<br />
<br />
P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.<br />
<br />
M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.<br />
<br />
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.<br />
<br />
I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.<br />
<br />
G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.<br />
<br />
N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.<br />
<br />
H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.<br />
<br />
Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf<br />
<br />
Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.<br />
<br />
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929-1958.<br />
<br />
<br />
Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):<br />
# http://www.onmyphd.com/?p=proximal.operator<br />
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_What_and_Where_to_Draw&diff=31746Learning What and Where to Draw2017-12-03T04:42:46Z<p>Jdeng: /* Discussion */</p>
<hr />
<div><br />
== Introduction ==<br />
<br />
Generative Adversarial Networks (GANs) have been successfully used to synthesize compelling real-world images. In what follows we outline an enhanced GAN called the Generative Adversarial What- Where Network (GAWWN). In addition to accepting as input a noise vector, this network also accepts instructions describing what content to draw and in which location to draw the content. Traditionally, these models use simple conditioning variables such as a class label or a non-localized caption. The authors of 'Learning What and Where to Draw' believe that image synthesis will be drastically enhanced by incorporating a notion of localized objects. <br />
<br />
The main goal in constructing the GAWWN network is to separate the questions of 'what' and 'where' to modify the image at each step of the computational process. At a high-level, the purpose of GAWWN network is to give the generative model more variables to condition the location in addition to text. One of the methods is to draw a bounding box that outlines the location of the subject in question within, such that like-images are provided with the subject in approximately the same location depicted in the aforementioned bounding box. Another method to determine the objects location is to use keypoints to locate features of the images such that the generative model can provide like-images that contain the location of the aforementioned features. Prior to elaborating on the experimental results of the GAWWN the authors cite that this model benefits from greater parameter efficiency and produces more interpretable sample images, since it is known from the input description what each image is intended to depict. The proposed model learns to perform location and content-controllable image synthesis on the Caltech-UCSD (CUB) bird data set and the MPII Human Pose (HBU) data set.<br />
<br />
A highlight of this work is that the authors demonstrate two ways to encode spatial constraints into the GAN. First, the authors provide an implementation showing how to condition on the coarse location of a bird by incorporating spatial masking and cropping modules into a text-conditional General Adversarial Network ('''Bounding-box-conditional text-to-image model'''). This technique is implemented using spatial transformers. Second, the authors demonstrate how they are able to condition on part locations of birds and humans in the form of a set of normalized (x,y) coordinates ('''Keypoint-conditional text-to-image model''').<br />
<br />
== Related Work == <br />
<br />
This is not the first paper to show how Deep convolutional networks can be used to generate synthetic images. Other notable works include:<br />
* Dosovitsky et al. (2015) trained a deconvolutional network to generate 3D chair renderings conditioned on a set of graphics codes indicating shape, position and lighting<br />
* Yang et al. (2015) followed with a recurrent convolutional encoder-decoder that learned to apply incremental 3D rotations to generate sequences of rotated chair and face images<br />
* Reed et al. (2015) trained a network to generate images that solved visual analogy problems<br />
* Gregor et al. (2015) used a recurrent variational autoencoder with attention mechanisms for reading and writing different portions of the image canvas.<br />
The authors cite how the above models are all deterministic and discuss how other recent work attempts to learn a probabilistic model with a convolutional variational autoencoders (Kingma and Welling, 2014, Rezende et al., 2014) in which the latent space were in separate blocks corresponding to graphics codes. In discussing current work in this area it is stated how all of the above formulations could benefit from the principle of separating what and where conditioning variables. The authors also cite the simple and popular Generative Adversarial Networks (Goodfellow et al.2014, which produces sharper synthetic images compared to images generated by VAE .<br />
The current paper's work is built on top of Reed et al. "Generative Adversarial Text to Image Synthesis, ICML 2016" where the authors proposed an end-to-end deep neural architecture based on conditional GAN framework, which successfully generated realistic images (64 ×64) from natural language descriptions. Also, Lei et al. (2017) proposed a method for MR-to-CT synthesis by a novel deep embedding convolutional neural network (DECNN). Specifically, they generated feature maps from MR images, and then transformed these feature maps using embedding convolutional layers in the network.<br />
<br />
== Background Knowledge ==<br />
=== Generative Adversarial Networks === <br />
<br />
Before outlining the GAWWN we briefly review GANs. A GAN consists of a generator G that generates a synthetic image given a noise vector drawn from either a Gaussian or Uniform distribution. The discriminator is tasked with classifying images generated by the generator as either real or synthetic. The two networks compete in the following minimax game: <br />
<br />
<br />
$\displaystyle \min_{G}$<br />
$\max\limits_{G} V(D,G) = \mathop{\mathbb{E}}_{x \sim p_{data}(x)}[log[D(x)] + \mathop{\mathbb{E}}_{z \sim p_{z}(z)}[log(1-D(G(z)))] $<br />
<br />
where z is a noise vector. In the context of GAWWN networks we are playing the above minimax game with G(z,c) and D(z,c), where c is the additional what and where information supplied to the network. For the input tuple (x,c) to be interpreted as "real", the image x must not only look real but also match the context information in c.<br />
More details about GANs and the fundamental algorithm that GAWWN is built on are explained below. The main goal of GAN is to provide a generative model for data (e.eg., images). Learning in GANs proceeds via comparison of simulated data with some real input data. There are basically 2 stages within a GAN structure: a generator network and a discriminator network.<br />
<br />
Within generator network, objects or in this case images are generated from some distribution <math>p_z(z)</math>. Data generated from generator network are then passed through the discriminator network along with real input dataset. Within the discriminator network, it’s trained to differentiate real input from simulated input. The goals of this structure are to train the generator network to be able to simulate images that can “fool” the discriminator network when compared to real input data, and to train the discriminator network to be able to distinguish a “fake” input from real input data. Mathematically, such optimization problem is summarized into the minimax game demonstrated above. According to this blog post (Introductory guide to Generative Adversarial Networks (GANs) [https://www.analyticsvidhya.com/blog/2017/06/introductory-generative-adversarial-networks-gans/]), the trainings of generator network and discriminator network are separated (Shaikh, 2017). First, discriminator network is trained on real input data and simulated data from generator network to learn what real data look like. Then, using losses propagated through the discriminator network, the generator network is trained to simulate fake data that can be predicted as real data in the previous discriminator network (Shaikh, 2017). This process is repeated iteratively, with each component network adversarially learning to outperform the other. <br />
<br />
=== Structured Joint Embedding of Visual Descriptions and Images ===<br />
In order to encode visual content from text descriptions the authors use a convolutional and recurrent text encoder to establish a correspondence function between images and text features. This approach is not new, the authors rely on the previous work of Reed et al. (2016) to implement this procedure. To learn sentence embeddings the following function is optimized:<br />
<br />
<br />
$\frac{1}{N}\sum_{n=1}^{N} \Delta (y_{n}, f_{v}(n)) + \Delta (y_{n}, f_{t}(t_{n})) $<br />
<br />
where ${(v_{n}, t_{n}, , n=1,...N)}$ is the training data, $\Delta$ is the 0-1 loss, $v_{n}$ are the images, $t_{n}$ are the text descriptions of class y. The functions $f_{v}$ and $f_{t}$ are defined as follows:<br />
<br />
<br />
$ f_{v}(v)$ = $\displaystyle \max_{y \in Y}$ $\mathop{\mathbb{E}}_{t \sim T(y)}[\phi(v)^{T}\varphi(t)], \space f_{t}(t) = \displaystyle \max_{y \in Y}$ $\mathop{\mathbb{E}}_{v \sim V(y)}[\phi(v)^{T}\varphi(t)]$<br />
<br />
where $\phi$ is the image encoder and $\varphi$ is the text encoder. The intuition behind the encoder is relatively simple. The encoder learns to produce a larger score with images of the correct class compared to the other classes, and works similarly going in the other direction.<br />
<br />
== GAWWN Visualization and Description == <br />
===Bounding-box-conditional text-to-image model=== <br />
[[File:bounding box.PNG]]<br />
==== Generator Network ====<br />
* Step 1: Start with input noise and text embedding<br />
* Step 2: Replicate text embedding to form a $M \times M \times T$ feature map then wrap spatially to fit into unit interval bounding box coordinates<br />
* Step 3: Apply convolution, pooling to reduce spatial dimension to $1 \times 1$<br />
* Step 4: Concatenate feature vector with the noise vector z<br />
* Step 5: Generator branching into local and global processing stages<br />
* Step 6: Global pathway stride-2 deconvolutions, local pathway apply masking operation applied to set regions outside the object bounding box to 0<br />
* Step 7: Merge local and global pathways<br />
* Step 8: Apply a series of deconvolutional layers and in the final layer apply tanh activation to restrict output to [-1,1]<br />
<br />
====Discriminator Network====<br />
* Step 1: Replicate text as in Step 2 above<br />
* Step 2: Process image in local and global pathways<br />
* Step 3: In local pathway stride2-deconvolutional layers, in global pathway convolutions down to a vector<br />
* Step 4: Local and global pathway output vectors are merged <br />
* Step 5: Produce discriminator score<br />
In this initial stages of this process,the researchers average the feature maps in the presence of multiple localized captions. The reliability of this heuristic is unknown and comparisons are not drawn with alternative heuristics such as Max-Pooling or Min-Pooling. Alternative heuristics could improve results.<br />
<br />
=== Keypoint-conditional text-to-image === <br />
<br />
[[File:key point text.PNG]]<br />
<br />
==== Generator Network ====<br />
* Step 1: Keypoint locations are encoded into a $M \times M \times K$ spatial feature map<br />
* Step 2: Keypoint tensor progresses through several stages of the network<br />
* Step 3: Concatenate keypoint vector with noise vector<br />
* Step 4: Keypoint tensor is flattened into a binary matrix, then replicated into a tensor<br />
* Step 5: Noise-text-keypoint vector is fed to global and local pathways<br />
* Step 6: Orginal keypoint tensor is is concatenated with local and global tensors with additional deconvolutions<br />
* Step 7: Apply tanh activation function<br />
<br />
==== Discriminator Network ====<br />
* Step 1: Feed text-embedding into discriminator in two stages<br />
* Step 2: Combine text embedding additively with global pathway for convolutional image processing<br />
* Step 3: Spatially replicated text-embedding and concatenate with feature map <br />
* Step 4: Local pathway produces into stride-2 deconvolutions producing an output vector<br />
* Step 5: Combine local and global pathways and produce discriminator score <br />
<br />
=== Conditional keypoint generation model ===<br />
In creating this application the researchers discuss how it is not feasible to ask the user to input all of the keypoints for a given image. In order to remedy this issue, a method is developed to access the conditional distributions of unobserved keypoints given a subset of observed keypoints and the image caption. In order to solve this problem, a generic GAN is used. <br />
<br />
The authors formulate the generator network $G_{k}$ for keypoints s,k as follows: <br />
<br />
$G_{k}(z,t,k,s) := s \odot k + (1-s) \odot f(z,t,k)$<br />
<br />
where $\odot$ denotes pointwise multiplication and $f: \Re^{Z+T+3K} \mapsto \Re^{3k}$ is an MLP. As usual, the discriminator learns to distinguish real key points from synthetic keypoints.<br />
<br />
In both, the Bounding-box-conditional text-to-image model and Keypoint-conditional text-to-image model, the noise vector z plays an important role in image generation and keypoint generation. This effect is strongly seen in the poor human image generations. Perhaps, a denoising autoencoder should be included in the architecture. This would make the GAN invariant to environmental factors. This could also improve the feature learning whereby keypoints would more accurately be linked to the image poses while training.<br />
<br />
== Experiments == <br />
In this section of the wiki we examine the synthetic images generated by the GAWWN conditioning on different model inputs. The experiments are conducted with Caltech-USCD Birds (CUB) and MPII Human Pose (MHP) data sets. CUB has 11,788 images of birds, each belonging to one of 200 different species. The authors also include an additional data set from Reed et al. [2016]. Each image contains bird location via bounding box and keypoint coordinates for 15 bird parts. MHP contains 25K images with individuals participating in 410 different common activities. Mechanical Turk was used to collect three single sentence descriptions for each image. For HBU each image contains multiple sets of keypoints. During training the text embeddings for a given image were taken to be the average of a random sample from the encodings for that image. Caption information was encoded using a pre-trained char-CNN-GRU and Adam solver was used to train the GAWWN with a batch size of 16 and learning rate of 0.0002. <br />
<br />
=== Controlling via Bounded Boxes === <br />
[[File:bird loc bb.PNG]]<br />
<br />
*Observations: Similar background across different images but not perfectly invariant, changing bounding box coordinates does not change the direction the bird is facing. The authors take this to mean that the noise vector encodes information about the pose within the bounding box.<br />
*Note: Noise vector is fixed<br />
<br />
=== Controlling individual part locations via keypoints ===<br />
[[File:keypt 1.PNG]]<br />
*Observations: Bird pose respects keypoints and is invariant across samples, background is invariant with changes in noise<br />
*Notes: Noise vector is not fixed <br />
<br />
[[File:keypt 2.PNG]]<br />
*Observations: Keypoints can be used to shrink, translate and stretch objects, comparing with box points figure can control orientation<br />
<br />
=== Generating both bird keypoints and images from text alone ===<br />
[[File:im 53.PNG]]<br />
* Observations: There is no major difference in image quality when comparing synthetic images created using generated and ground truth keypoints<br />
=== Beyond birds: generating images of humans === <br />
[[File:bb2.PNG]]<br />
* Observations: The GAWWN network generates much blurrier human images compared to generated bird images, simple captions seem to work while complex descriptions still present challenges, strong relationship between image caption and image<br />
<br />
== Summary of Contributions == <br />
* Novel architecture for text- and location-controllable image synthesis, which yields more realistic and high-resolution Caltech-USCD bird samples <br />
* A text-conditional object part completion model enabling a streamlined user interface for specifying part locations <br />
* Exploratory results and a new dataset for pose-conditional text to human image synthesis<br />
<br />
== Resources==<br />
In this section, we enumerate any additional supplementary material provided by the authors to augment our understanding of the techniques discussed in the paper.<br />
<br />
Implementations of the GAWWNs can be found in the following repository: https://github.com/reedscot/nips2016.<br />
<br />
== Discussion ==<br />
The GAWNN does an excellent job of generating images conditioned on both informal text descriptions and object locations. Image location can be controlled using both bounding box and a set of keypoints. A major achievement is the syntheses of compelling 128 by 128 images, whereas previous models could only generate 64 by 64. Another strength of GAWNN is that it is not constrained at test time by the location conditioning, as the authors are able to learn a generative model of part locations, and generate them at test time. <br />
<br />
The ideas presented in this paper are in accord with other areas of applied mathematics. In Quantitative finance one is always looking to condition on additional information when pricing derivative securities. Variance reduction techniques provide ways of conditioning on additional information to improve efficiency in estimation procedures.<br />
<br />
This idea of learning "what" and "where" to draw can be applied to a number of fields including:<br />
# Using the generator to create musical sounds coming from a particular instrument (what) and insert it into another musical piece (where) using text<br />
# Using the "what" and "where" model to train the discriminator to identify doctored images/videos in crime<br />
# Identifying large scale stock market movements/patterns using by adding RNN layers to the GAN architecture<br />
<br />
The authors used the Generative Adversarial Networks as the neural architecture to synthesize compelling real-world images. But it is interesting to compare the performance of this network with the result from another architecture, variational auto-encoder network.<br />
<br />
== References ==<br />
# Z. Akata, S. Reed, S. Mohan, S. Tenka, B. Schiele, H.Lee. Learning What and Where to Draw. In NIPS 2016<br />
# Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of Output Embeddings for Fine-Grained Image Classification. In CVPR, 2015.<br />
# A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.<br />
# I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.<br />
# D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.<br />
# S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative Adversarial Text to Image Synthesis. In ICML, 2016.<br />
# Xiang, Lei, et al. "Deep Embedding Convolutional Neural Network for Synthesizing CT Image from T1-Weighted MR Image." arXiv preprint arXiv:1709.02073 (2017).<br />
# Shaikh, F. (2017, June 15). Introductory guide to Generative Adversarial Networks (GANs) : https://www.analyticsvidhya.com/blog/2017/06/introductory-generative-adversarial-networks-gans/</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Decoding_with_Value_Networks_for_Neural_Machine_Translation&diff=31745STAT946F17/Decoding with Value Networks for Neural Machine Translation2017-12-03T04:09:18Z<p>Jdeng: /* Training and Learning */</p>
<hr />
<div>=Introduction=<br />
<br />
==Background Knowledge==<br />
*'''NMT'''<br />
Neural Machine Translation (NMT), which is based on deep neural networks and provides an end-to-end solution to machine translation, uses an '''RNN-based encoder-decoder architecture''' to model the entire translation process. Specifically, an NMT system first reads the source sentence using an encoder to build a "thought" vector, a sequence of numbers that represents the sentence meaning; a decoder, then, processes the "meaning" vector to emit a translation. (Figure 1)<sup>[[#References|[1]]]</sup><br />
[[File:VNFigure1.png|thumb|600px|center|Figure 1: Encoder-decoder architecture – example of a general approach for NMT.]]<br />
<br />
<br />
<br />
*'''Generalization: Sequence-to-Sequence(Seq2Seq) Model'''<br />
[[File:VNFigure4.png|thumb|450px|center|Figure 2: Seq2Seq Model: an example of text generation]]<br />
- Two RNNs: an encoder RNN, and a decoder RNN<br />
<br />
#In the seq2seq model, we need to use embedding, so we have to first compile a vocabulary list containing all the words that the model is able to use or read. The inputs are the tensors containing the IDs of the words in the sequence, and the input is passed through the encoder and it’s final hidden state, the “thought vector” is passed to the decoder as it’s initial hidden state. <br />
#Decoder given the start of sequence token, <SOS>, and iteratively produces output until it outputs the end of sequence token, <EOS><br />
<br />
- Commonly used in text generation, machine translation, speech recognition and related problems<br />
<br />
<br />
<br />
*'''Beam Search'''<br />
Decoding process:<br />
[[File:VNFigure2.png|thumb|600px|center|Figure 3]]<br />
Problem: Choosing the word with the highest score at each time step t is not necessarily going to give you the sentence with the highest probability(Figure 3). Beam search solves this problem (Figure 4). Beam search is a heuristic search algorithm such that at each time step t, it takes the top m proposal and continues decoding with each one of them. In the end, you will get a sentence with the highest probability, not in the word level. The algorithm terminates if all sentences are completely generated, i.e., all sentences are ended with the <EOS> token.<br />
[[File:VNFigure3.png|thumb|600px|center|Figure 4]]<br />
<br />
*'''BLEU Score'''<br />
The BLEU score is an automatic method for evaluating the success of machine translation [9]. It is language independent. The basic method is to compare the n-grams (words, short phrases) of the reference translation to the output/candidate translation. The naive approach is to rank the success of the translation by counting the number of words that match and then divide by the total number of words of the candidate translation. However, this rewards translations that overuse common words such as "a" and "the". This problem is solved by imposing a limit on how many times a word can be used to increase the score of the translation. Additional modifications of the algorithm are implemented to handle scoring for sentence length, and for when a source sentence might translate to multiple candidate sentences.<br />
<br />
==Value Network==<br />
<br />
Upon the success of NMT with beam search, beam search tends to focus more on short-term reward, which is called myopic bias. At time t, a word $w$ may be appended to the candidates $y_{<t} = y_1,...,y_{t-1}$ if $P(y_{<t}+w|x) > P(y_{<t}+w'|x)$ even if the word $w'$ is the ground truth translation at step t or can offer a better score in future decodings. By applying the concept of value function in Reinforcement Learning (RL), the authors develop a neural network-based prediction model, '''the value network for NMT''', to estimate the long-term reward when appending $w$ to $y_{<t}$ to address the myopic bias.<br />
<br />
The value network takes the source sentence and any partial target sequence as input, and outputs a predicted value to estimate the expected total reward (e.g. BLEU<sup>[[#References|[2]]]</sup>) so that select the best candidates in the decoding step is based on both conditional probability of the partial sequence outputted by the NMT model and the estimated long-term reward outputted by the value network.<br />
<br />
==Contributions==<br />
1) Developing a decoding scheme that considers long-term reward while generating words one by one for machine translation. At each step, the new decoding scheme not only considers the probability of the word sequence conditioned on the source sentence but also relies on the predicted future reward. Ideally, such two aspects can lead to better final translation.<br />
<br />
2) Building another two modules for the value network, a semantic matching module, and a context-coverage module. The semantic matching module estimates the similarity between the source and target sentences. The context-coverage module measures the coverage of context used in the encoder-decoder layer observing the fact that better translation is generated when using more context in the attention mechanism.<br />
<br />
Results of translation experiments demonstrate the effectiveness and robustness of the new decoding mechanism compared to several baseline algorithms.<br />
<br />
=Neural Machine Translation=<br />
<br />
NMT systems are implemented with a RNN based encoder-decoder framework, which directly models the probability $P(y|x)$ of a target sentence $y = {y_1,...,y_{T_y}}$ conditioned on the source sentence $x = {x_1,...,x_{T_x}}$, where $T_x$ and $T_y$ are the length of sentence x and y.<br />
<br />
The encoder of NMT reads the source sentence x word by word and generates a hidden representation for each word xi:<br />
$$<br />
h_i = f(h_{i-1},x_i)<br />
$$<br />
where function f is the recurrent unit such as LSTM unit or GRU.<br />
<br />
Then the decoder of NMT computes the conditional probability of each target word $y_t$ conditioned on its preceding words $y_{<t}$ and source sentence:<br />
$$<br />
\begin{align*}<br />
c_t = q(r_{t-1};h_1,...,h_{T_x})\\<br />
r_t = g(r_{t-1},y_{t-1},c_t)\\<br />
P(y_t|y_{<t},x)\propto exp(y_t;r_t,c_t)<br />
\end{align*}<br />
$$<br />
where,<br />
<br />
$c_t$: weighted contextual information summarizing the source sentence x using some attention mechanism.<br />
<br />
$r_t$: decoder RNN hidden representation at step t, computed by an LSTM or GRU<br />
<br />
Notations:<br />
# $Θ$: All learned parameters<br />
# $π_{Θ}$: Translation model with parameter $Θ$<br />
# D: training dataset that contains source-target sentence pairs.<br />
The training process aims at seeking the optimal parameters $Θ^*$ using MLE to encode source sentence and decode it into the target sentence using maximum likelihood estimation.<br />
<br />
\begin{align*}<br />
Θ^* &= \displaystyle\arg\max_{Θ} \prod\limits_{(x,y)∈D}P(y|x;Θ) \\<br />
&= \displaystyle\arg\max_{Θ}\prod\limits_{(x,y)∈D}\prod\limits^{T_y}_{t=1}P(y_t|y_{<t},x;Θ) <br />
\end{align*}<br />
<br />
=Value Network for NMT=<br />
'''Motivation''':<br />
Beam search has limited ability at predicting a high-quality sentence due to myopic bias. Locally good words may not be the best word for the complete sentence. Beam search can erroneously choose locally good words. These errors constitute the myopic bias (the short sightedness). To reduce myopic bias, the long-term value of each action needs to be predicted and this value should be used in decoding.<br />
<br />
<br />
[3]<sup>[[#References|[3]]]</sup> introduced scheduled sampling approach, which takes the generated outputs from the model and the golden truth sentence in training to help the model learn from its own errors but cannot avoid the myopic bias of beam search during testing. [4]<sup>[[#References|[4]]]</sup> learns a predictor to predict the ranking score of a certain word at step t, and use this score to replace the conditional probability outputted by the NMT model for beam search during testing. Unfortunately, this work still looks only one step forward and cannot address the problem. Thus the authors motivate to estimate the expected performance of any sequence during decoding based on the concept of value function in reinforcement learning.<br />
<br />
==Value Network Structure==<br />
In conventional reinforcement learning, a value function describes how much cumulated reward could be collected from state s by following certain policy $π$. In NMT, consider $x$ and $y_{<t}$ as the state, $π_{Θ}$ as policy which can generate a word (action) given any state. The value function characterizes the expected translation performance (e.g. BLEU score) when using $π_{Θ}$ to translate $x$ with the previous $t-1$ words, $y_{<t}$. The value function is defined as:<br />
<br />
<math>v(x,y_{<t}) = \sum\limits_{y'∈Y: y'_{<t}=y_{<t}}BLEU(y^*(x),y')P(y'|x;Θ)</math><br />
<br />
where $y^*(x)$: ground truth translation, $Y$: the space of complete sentence.<br />
<br />
<br />
Two modules are developed to fully exploit the information in the encoder-decoder framework.<br />
[[File:VNFigure5.png|thumb|600px|center|Figure 5: Architecture of Value Network]]<br />
'''Semantic Matching (SM) Module''': at time step, use mean pooling over the decoder RNN hidden states and the over the context states as summarizations of the partial target sequence and the context in source language. Then use a feed-forward network to evaluate semantic information between the source sentence and the target sentence.<br />
i.e.,<br />
# $\bar{r_{t}} = \frac{1}{t}\sum\limits^{t}_{l=1}r_l$, $\bar{c_{t}} = \frac{1}{t}\sum\limits^{t}_{l=1}c_l$<br />
# $𝜇_{SM} = f_{SM}([\bar{r_{t}},\bar{c_{t}}])$<br />
<br />
'''Context-Coverage (CC) Module''': It is often observed that more context covered in the attention model leads to better translation. CC is built to measure coverage of information in the network. Similarly as the process in SM module, the process is defined as:<br />
# $\bar{h} = \frac{1}{T_x}\sum\limits^{T_x}_{l=1}h_l$<br />
# $𝜇_{CC} = f_{CC}([\bar{c_{t}},\bar{h}])$<br />
<br />
In the end, the authors concatenate $𝜇_{SM}$ and $𝜇_{CC}$ and then use another fully connected layer with sigmoid activation function to output a scalar as the value prediction. (Figure 5)<br />
<br />
==Training and Learning==<br />
The authors adopt the Monte-Carlo method to learn the value function. The training of the value network for NMT model $π_{Θ}$ is shown in Algorithm 1. For randomly picked source sentence $x$ in the training corpus, they generate a partial target sentence $y_p$ using $\pi_{\Theta}$ with random early stop, i.e., they randomly terminate the decoding process before its end. Then for the pair $(x, y_p)$, they use $\pi_{\Theta}$ to finish the translation starting from $y_p$ and obtain a set $S(y_p)$ of K complete target sentences, e.g., using beam search. In the end, they compute the BLEU score of each complete target sentence and calculate the averaged BLEU score of $(x, y_p)$.<br />
[[File:VNFigure6.png|thumb|600px|center]]<br />
[[File:VNFigure7.png|thumb|600px|center]]<br />
<br />
In conventional Monte-Carlo method for value function estimation, people usually use a regression model to approximate the value function, i.e., learn a mapping from $(x, y_p) \rightarrow avg\_bleu(x, y_p)$ by minimizing the mean square error (MSE). In this paper, the authors take an alternative objective function which is shown to be more effective in experiments.<br />
[[File:VNFigure8.png|thumb|600px|center]]<br />
The authors hope the predicted score of $(x, y_{p,1})$ can be larger than that of $(x, y_{p,2})$ by certain margin if $avg\_bleu(x, y_{p,1}) ><br />
avg\_bleu(x, y_{p,2})$. The reason to use this loss function is to penalize the bad example ( $v_w(x,y_{p,2}) > v_w(x,y_{p,1})$) exponentially. <br />
<br />
Notes:<br />
* 4-8 for training; 9 for learning<br />
* 9: Eqn.(7), for value function estimation, the authors minimize the pairwise ranking loss objective function to learn the mapping from $(x,y_p)$ → avg_bleu$(x,y_p)$ since they hope the value network to be useful in differentiating good and bad examples.<br />
* 11: $w$ is the learned weights of the value network<br />
<br />
==Decoding Process==<br />
<br />
The authors linearly combine the conditional probability $P(y|x)$, which is the output of the NMT model and the value network representing the future reward motivated by the success of AlphaGo<sup>[[#References|[5]]]</sup> They believe it would generate a better result by considering both past and future information. Mathematically,given a translation model $P(y|x)$, a value network $v(x,y)$ and a hyperparameter $𝛼∈(0,1)$, the '''score''' of partial sequence y for x is:<br />
<br />
$𝛼×\frac{1}{|y|}\,log\,P(y|x) + (1-𝛼)×log\,v(x,y)$<br />
<br />
where $|y|$ is the length of y. The details of the decoding process are presented in Algorithm 2. This neural network based decoding algorithm is called NMT-VNN for short.<br />
[[File:VNFigure9.png|thumb|600px|center]]<br />
Notes:<br />
* If $𝛼=1$, it is just the original beam search algorithm (i.e., without considering the value network)<br />
* $U_{expand}$: {Append each word w from the vocabulary to each partial sentence $y_i$ we got from last time step}<br />
* $U_{complete}$: {All sentences are complete at current time step, end it with an <EOS> token.}<br />
* $S$: {All full sentences found by beam search from time step 1 up to time step t-1.}<br />
* Step 11: Output sequence y with the highest score of the K sequences<br />
<br />
=Experiments=<br />
==Settings==<br />
The authors compare their proposed NMT-VNN with two baseline models, the classic NMT with beam search (NMT-BS)<sup>[[#References|[6]]]</sup> and the one referred as beam search optimization (NMT-BSO), which trains a predictor to evaluate the quality of any partial sequence, and then uses the predictor to select words instead of the probability. <br />
Note the difference between NMT-BSO and their NMT-VNN is NMT-BSO predicts the local improvement of BLEU for any single word, while NMT-VNN predicts the '''final''' BLEU score and use the predicted score to '''select words'''.<br />
<br />
The models are tested using three pairs of languages: English→French (En→Fr), English→German (En→De), and Chinese→English (Zh→En). They used the same bilingual corpora from WMT’14 as used in [6], which contains 12M, 4.5M and 10M training data for each task.<sup>[[#References|[6]]]</sup><br />
* For En→Fr and En→De: validation set: newstest2012 and newstest2013; test set: newstest2014<br />
* For Zh→En: validation set: NIST 2004; test set: NIST 2006 and NIST 2008<br />
For all datasets in Chinese, they use a public tool for Chinese word segmentation. In all experiments, validation sets were only used for early-stopping and hyperparameter tuning.<br />
<br />
For NMT-VNN and NMT-BS, they first set experimental parameters to train an NMT model following <sup>[[#References|[6]]]</sup>. The vocabulary for each language is the most common 30K in the parallel corpora and the words not in the vocabulary (i.e., unknown words) were replaced with a special token “UNK". Each word was embedded into a vector space of 620 dimensions, and recurrent unit has the dimension of 1000. Sentences with length ≤ 50 were kept in the training set. Batch size was set as 80 with 20 batches pre-fetched and sorted by sentence lengths.<br />
<br />
The NMT model was trained with asynchronized SGD on four K40m GPUs for about seven days. For NMT-BSO, they implemented the algorithm and the model was trained in the same environment.<br />
For the value network used in NMT-VNN, they set the same parameters for the encoder-decoder layers as the NMT model. Additionally, in the SM module and CC module, they set function $μ_{SM}$ and $μ_{CC}$ as single-layer feed forward networks with 1000 output nodes. In Algorithm 1, they set the hyperparameter K = 20 to estimate the value of any partial sequence. They used mini-batch training with batch size 80, and the value network model was trained with AdaDelta <sup>[[#References|[8]]]</sup> on one K40m GPU for about three days.<br />
<br />
<br />
During testing, the hyperparameter $𝛼$ for NMT-VNN was set by cross validation and they found the optimal $𝛼$ for En→Fr, En→De and Zh→En are 0.85, 0.9 and 0.8 respectively. They used the BLEU score <sup>[[#References|[9]]]</sup> as the evaluation metric, which is computed by the multi-bleu.perl<sup>[[#References|[10]]]</sup>. Beam search size was set to be 12 for all the algorithms following the common practice <sup>[[#References|[11]]]</sup>.<br />
<br />
==Results==<br />
[[File:VNFigure10.png|thumb|400px|right]]<br />
[[File:VNFigure11.png|thumb|400px|right|Figure 6: BLEU Score vs. Source Sentence Length]]<br />
Table 1 shows that the NMT-VNN algorithm outperforms the baseline algorithms on all tasks, especially for harder level translations (Zh→En). <br />
<br />
For En→Fr and En→De tasks, NMT-VNN outperforms NMT-BS by 1.03/1.3 points due to the additional use of value network, which suggests the additional knowledge provides useful information to help the NMT model. NMT-VNN outperforms NMT-BSO by about 0.31/0.33 points since NMT-BSO only uses a local BLEU predictor to estimate the partial BLEU score while NMT-VNN predicts the future performance, which shows the advantage of considering long-term rewards. For Zh→En task, NMT-VNN outperforms NMT-BS by 1.4/1.82 points on NIST 2006 and NIST 2008, and outperforms NMT-BSO by 1.01/0.72 points.<br />
<br />
The plots BLEU Score vs. Source Sentence Length (Figure 6) depict NMT-VNN algorithm outperforms the baseline algorithms for almost any sentence length.<br />
<br />
Furthermore, they also test the value network on a deep NMT model in which the encoder and decoder both have 4-layer LSTMs. The result (Table 1) shows 0.33 points improvement on the En→Fr task. These results demonstrate the effectiveness and robustness of the NMT-VNN algorithm.<br />
<br />
==Analysis on Value Network==<br />
[[File:VNFigure12.png|thumb|600px|center|Figure 7: (a)BLEU scores of En→De task w.r.t different beam size (b)BLEU scores of En→De<br />
task w.r.t different hyperparameter 𝛼]]<br />
First, the additional component in decoding will affect the efficiency of the translation process. The value network is very similar to basic NMT model in terms of the architecture and the computational complexity. As an advantage, these two models can be trained parallelly.<br />
<br />
Second, the accuracy of NMT sometimes drops dramatically as the beam size grows on certain tasks because the training of NMT favors short but inadequate translation candidates<sup>[[#References|[12]]]</sup>. This happens for the En→De translation, however, such shortage can be largely avoided by using value network. Figure 7(a) shows the accuracy (BLEU score) using different beam size for the NMT-BS and the NMT-VNN models. NMT-VNN is much more stable than the original NMT as its accuracy only differs a little for different beam sizes while NMT-BS drops more than 0.5 points when the beam size is large.<br />
<br />
Third, the performances of NMT-VNN using different hyperparameter 𝛼 during decoding for En→De task is shown in Figure 7(b) and it is stable for 𝛼 ranging from 0.7 to 0.95, and slightly drops for a smaller 𝛼. The authors declare the proposed algorithm is robust to the hyperparameter.<br />
<br />
=Conclusions and Future Work=<br />
This paper introduces a new decoding scheme that incorporates value networks for NMT. The new decoding scheme considers not only the local conditional probability of a candidate word but also its long-term reward for future decoding. Experiments on three translation tasks verify the effectiveness of the new scheme. For future works:<br />
*designing better structures for the value network.<br />
*extending it to other sequence-to-sequence learning tasks, such as image captioning and dialog systems.<br />
<br />
<br />
There is a follow up paper[14] on this from the same authors, who not just modify the decoder objective as explained in this paper but change the structure of the framework by introducing the second-pass decoder into it. The approach achieves a new single model state-of-the-art result in WMT’14 English to French translation.<br />
<br />
==Critiques==<br />
# It is a good idea to consider future reward by using the value network in addition to beam search which is based on past information. Intuitively, the NMT-VNN should improve the NMT by using both past and future information.<br />
# The paper does not give much information or give any quantitative evaluations about the two modules they built, for example, how much the two modules contribute to the accuracy of their NMT-VNN model?<br />
# In algorithm 1 step 5, it is reasonable to generate partial target sentence $y_{p,1}$,$y_{p,2}$ using $π_{Θ}$ with random early stop, but if the stop times for the two sentences are too far from each other, there could be a problem for the beam search...since beam size is kind of related to the number of words of the sentence left to generate.<br />
# In the experiment results section, they only test the value network on a deep NMT model on the En→Fr task and show the improvement, is that true for all other translations?<br />
#In the decoder section of the network, the authors could have experimented with fast-forward linear connections while stacking LSTMs. This technique has proven to obtain some of the best empirical accuracies in machine translation ( En->Fr BLEU= 37.7) [13].<br />
<br />
=References=<br />
<br />
[1] https://github.com/tensorflow/nmt<br />
<br />
[2] https://en.wikipedia.org/wiki/BLEU<br />
<br />
[3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[4] S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In EMNLP, 2016.<br />
<br />
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.<br />
<br />
[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.<br />
<br />
[7] [https://baike.baidu.com/item/%E4%B8%AD%E6%96%87%E5%88%87%E8%AF%8D Chinese word segmentation]<br />
<br />
[8] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.<br />
<br />
[9] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.<br />
<br />
[10] https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl.<br />
<br />
[11] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.<br />
<br />
[12] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li. Neural machine translation with reconstruction. In AAAI, pages 3097–3103, 2017.<br />
<br />
[13] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, Wei Xu ; "Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation". arXiv:1606.04199 [cs.CL]<br />
<br />
[14] Yingce Xia et al. Deliberation Networks: Sequence Generation Beyond One-Pass Decoding, NIPS 2017<br />
<br />
[15] https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=31645Modular Multitask Reinforcement Learning with Policy Sketches2017-11-28T19:20:02Z<p>Jdeng: /* References */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
[[File:MRL_diagram.jpg|thumb|right|400px| Figure 1b: the diagram for policy sketches]]<br />
[[File:MRL_encode.jpg|thumb|right|600px| Figure 1c: All sub tasks are encoded without any semantic meanings]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them—specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describes its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks: "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards. These sketches are initially meaningless but eventually the sketches get mapped into real policies. As shown in Figure 1c, the tasks are divided into human readable sub tasks. However, in the actual settings, the learner can only get access to encoded results.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In an adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Related Work'''=<br />
The approach in this paper is a specific case of the options framework developed by Sutton et al., 1999. In that work, options are introduced as "closed-loop policies for taking action over the period of time". They show that options enable temporally abstract information to be included in reinforcement learning algorithms, though it was published before the large-scale popularity of neural networks for reinforcement.<br />
<br />
The authors in this paper focus on learning in an interactive environment. However, they have done similar work in other applications [14]. There, they developed models for question answering tasks. However, in the present work there is no longer direct supervision of the learning process. The authors claim that the two problems are complementary and propose that in the future natural language hints are used instead of semi-structured sketches.<br />
<br />
Other authors have recently explored techniques for learning policies which require less prior knowledge of the environment than the method presented in this paper. For example, in Vezhnevets et al. (2016), the authors propose an RNN architecture to build "implicit plans" only through interacting with the environment as in the classic reinforcement learning problem formulation.<br />
<br />
One closely related line of work is the Hierarchical Abstract Machines (HAM) framework introduced by Parr & Russell, 1998 [11]. Like the approach which the Modular Multitask Reinforcement Learning with Policy Sketches uses, HAMs begin with a representation of a high-level policy as an automaton (or a more general computer program; Andre & Russell,<br />
2001 [7]; Marthi et al., 2004 [12]) and use reinforcement learning to fill in low-level details.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask. Symbolic sketches can be thought of as a high level symbolic labels drawn from some fixed vocabulary, which initially are devoid of any meaning. Eventually the sketches get mapped into real policies and enable policy transfer and temporal abstraction. Learning occurs through a variant of the standard actor critic architecture that will be explained in later sections.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize overall $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Control policy parameterized by parameter vector $\theta$, $$\displaystyle \max_{\Theta}E[\sum_{t=0}^{H}R(s_{t})|\pi_{\theta}]$$ $\pi_{\theta}(u|s)$ is the probability of action u in state s. The details of policy optimization can be found here: https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme (Bengio et al.,2009) is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially, the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in a high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000 and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL_maze.png|boarder|right|400px]]<br />
<br />
The maze environment is very similar to “light world” described by [4], which can be seen in Figure 3c. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically, they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task/subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model is shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
[[File:MRL10.png|border|right|400px]]<br />
In addition to the overall modular parameter-tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. Ablation analysis investigates the extent to which these components contribute to the performance.<br />
<br />
To evaluate the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalized advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Experiment results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
Two other experiments are also performed as Figure 5b: starting with short examples and moving to long ones, and sampling tasks in inverse proportion to their accumulated reward. It is shown that both components help; prioritization by both length and weight gives the best results.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|left|320px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided with a sketch for the new task and must immediately achieve good performance, and an adaptation setting, where no sketch is provided, leaving the model to learn a suitable sketch via interaction. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in the section 'Learning Modular Policies from Sketches'. Results are shown in the table to the left. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
This paper studies the problem of abstract hierarchical multiagent RL with policy sketches, high level descriptions of abstract actions. The work is related to much previous work in hierarchical RL, and adds some new elements by using neural implementations of prior work on hierarchical learning and skill representations. They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural sub policy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable sub policies which can be employed for zero-shot generalization when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
Hierarchical Reinforcement Learning is a popular research topic now, it is interesting to compare this paper(which was presented in this class)[13], where the architecture follows the manager-and-workers style. In that work, the sub policy is decided by manager network. To finish a hierarchical task, each worker focuses on the sub task and optimizes the sub task. The difference is in that work, the sub policy is implicit and is trained in the training process. A further question for future work for this paper: would these sub policy be learned automatically instead of pre-defined?<br />
<br />
There are four drawbacks of the presented work. First, the presented ideas portrayed in this paper (for instance: symbolic specifications, actor-critic, shared representations) have all been explored in other works. Secondly, the aforementioned approach relies heavily on curriculum learning - which is, as we know, difficult to design as it is pretty complicated depending on the task in-hand. Thirdly, there has been no discussion on how curriculum can be designed for larger problems. Finally, this approach could be that building of different neural networks for each sub tasks could lead to overly complicated networks and is not in the spirit of building an efficient structure.<br />
<br />
When the paper is considered in the context of the OpenReview discussion with the authors [https://openreview.net/forum?id=H1kjdOYlx] some interesting criticisms of the paper are brought to light. Many of the criticisms are based on the novelty of the approach. Reviewers contest that, the paper doesn't seem to offer anything new, except for providing new implementation in the context of deep learning. The problem solved here can be seen as an extension to the option-learning problem with richer supervision. Hence it makes the problem simple to tackle with existing RL schemes. For example, learning from natural language instructions makes it easy to learn. And the model proposed in the paper shares many things common with existing hierarchical RL methods. The authors maintain, however, that the use of diagram instruction is indeed novel. Moreover, the authors focus on the ease of generation of these policy sketches (noting "The extra annotation we use here literally fits in a 10-line text file.") and commenting on the dramatic improvements that this approach provides. They further argue that this approach is fundamentally different from the traditional NLP approaches which are claimed to be equivalent. Traditional approaches are useless in the absence of natural language instructions as the model conditions on both the state and the instructions. Experiment 1 in this paper shows the added generality of the approach presented.<br />
<br />
A more salient criticism has to do with the utility of such an approach. Even if it is taken as granted that this approach, as presented, is novel and the results are replicable, what does this mean for future work or for applications of reinforcement learning? While the authors suggest that the sketches are, in fact, straightforward to create, practitioners are less likely to desire the generation of such sketches versus providing natural language instructions, for instance.<br />
<br />
= '''Resources''' =<br />
You can find a talk on this paper [https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=video&cd=3&cad=rja&uact=8&ved=0ahUKEwjNzPuBqM7XAhVK6mMKHQICAdEQtwIILDAC&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DNRIcDEB64x8&usg=AOvVaw1NHi2XExGXwhzzeJn5AcnR here].<br />
<br />
<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.<br />
<br />
[9] Author Jacob Andreas presenting the paper - https://www.youtube.com/watch?v=NRIcDEB64x8<br />
<br />
[10] Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., & Agapiou, J. (2016). Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems (pp. 3486-3494).<br />
<br />
[11] Parr, Ron and Russell, Stuart. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1998.<br />
<br />
[12] Marthi, Bhaskara, Lantham, David, Guestrin, Carlos, and Russell, Stuart. Concurrent hierarchical reinforcement learning. In Proceedings of the Meeting of the Association for the Advancement of Artificial Intelligence, 2004.<br />
<br />
[13] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.<br />
<br />
[14] J. Andreas, M. Rohrbach, T. Darrell, D. Klein. Learning to compose neural networks for question answering. In ''Proceedings of the Annual Meeting of the North American Chapter of the Association for Computational Linguistics'', 2016.<br />
<br />
[15] Bengio, Yoshua, et al. "Curriculum learning." Proceedings of the 26th annual international conference on machine learning. ACM, 2009.<br />
<br />
= Appendix =<br />
The authors provide a brief appendix that gives a complete list of tasks and sketches. Asterisk * indicates that the task was held out for generalization experiments in Section 4.5, but included in the multitask experiments of Sections 4.3 and 4.4.<br />
<br />
[[File: tasks.PNG]]</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Modular_Multitask_Reinforcement_Learning_with_Policy_Sketches&diff=31644Modular Multitask Reinforcement Learning with Policy Sketches2017-11-28T19:18:48Z<p>Jdeng: /* Curriculum Learning */</p>
<hr />
<div>='''Introduction & Background'''=<br />
[[File:MRL0.png|border|right|400px]]<br />
[[File:MRL_diagram.jpg|thumb|right|400px| Figure 1b: the diagram for policy sketches]]<br />
[[File:MRL_encode.jpg|thumb|right|600px| Figure 1c: All sub tasks are encoded without any semantic meanings]]<br />
This paper describes a framework for learning composable deep subpolicies in a multitask setting. These policies are guided only by abstract sketches which are representative of the high-level behavior in the environment. Sketches annotate tasks with sequences of named subtasks, providing information about high-level structural relationships among tasks but not how to implement them—specifically not providing the detailed guidance used by much previous work on learning policy abstractions for RL (e.g. intermediate rewards, subtask completion signals, or intrinsic motivations). General reinforcement learning algorithms allow agents to solve tasks in complex environments. Vanilla policies find it difficult to deal with tasks featuring extremely delayed rewards. Most approaches often require in-depth supervision in the form of explicitly specified high-level actions, subgoals, or behavioral primitives. The proposed methodology is particularly suitable where rewards are difficult to engineer by hand. It is enough to tell the learner about the abstract policy structure, without indicating how high-level behaviors should try to use primitive percepts or actions.<br />
<br />
This paper explores a multitask reinforcement learning setting where the learner is presented with policy sketches. Policy sketches are defined as short, ungrounded, symbolic representations of a task. It describes its components, as shown in Figure 1. While symbols might be shared across different tasks ( the predicate "get wood" appears in sketches for both the tasks: "make planks" and "make sticks"). The learner is not shown or told anything about what these symbols mean, either in terms of observations or intermediate rewards. These sketches are initially meaningless but eventually the sketches get mapped into real policies. As shown in Figure 1c, the tasks are divided into human readable sub tasks. However, in the actual settings, the learner can only get access to encoded results.<br />
<br />
The agent learns from policy sketches by associating each high-level action with a parameterization of a low-level subpolicy. It jointly optimizes over concatenated task-specific policies by tying/sharing parameters across common subpolicies. They find that this architecture uses the high-level guidance provided by sketches to drastically accelerate learning of complex multi-stage behaviors. The experiments show that most benefits of learning from very detailed low-level supervision (e.g. from subgoal rewards) can also be obtained from fairly coarse high-level policy sketches. Most importantly, sketches are much easier to construct. They require no additions or modifications to the environment dynamics or reward function and can be easily provided by non-experts (third party mechanical turk providers). This makes it possible to extend the benefits of hierarchical RL to challenging environments where it may not be possible to specify by hand the details of relevant subtasks. This paper shows that their approach drastically outperforms purely unsupervised methods that do not provide the learner with any task-specific guidance. The specific use of sketches to parameterize modular subpolicies makes better use of sketches than conditioning on them directly.<br />
<br />
The modular structure of this whole approach, which associates every high-level action symbol with a discrete subpolicy, naturally leads to a library of interpretable policy fragments which can be are easily recombined. The authors evaluate the approach in a variety of different data conditions: <br />
# Learning the full collection of tasks jointly via reinforcement learning <br />
# In a zero-shot setting where a policy sketch is available for a held-out task<br />
# In an adaptation setting, where sketches are hidden and the agent must learn to use and adapt a pretrained policy to reuse high-level actions in a new task.<br />
<br />
The code has been released at http://github.com/jacobandreas/psketch.<br />
<br />
='''Related Work'''=<br />
The approach in this paper is a specific case of the options framework developed by Sutton et al., 1999. In that work, options are introduced as "closed-loop policies for taking action over the period of time". They show that options enable temporally abstract information to be included in reinforcement learning algorithms, though it was published before the large-scale popularity of neural networks for reinforcement.<br />
<br />
The authors in this paper focus on learning in an interactive environment. However, they have done similar work in other applications [14]. There, they developed models for question answering tasks. However, in the present work there is no longer direct supervision of the learning process. The authors claim that the two problems are complementary and propose that in the future natural language hints are used instead of semi-structured sketches.<br />
<br />
Other authors have recently explored techniques for learning policies which require less prior knowledge of the environment than the method presented in this paper. For example, in Vezhnevets et al. (2016), the authors propose an RNN architecture to build "implicit plans" only through interacting with the environment as in the classic reinforcement learning problem formulation.<br />
<br />
One closely related line of work is the Hierarchical Abstract Machines (HAM) framework introduced by Parr & Russell, 1998 [11]. Like the approach which the Modular Multitask Reinforcement Learning with Policy Sketches uses, HAMs begin with a representation of a high-level policy as an automaton (or a more general computer program; Andre & Russell,<br />
2001 [7]; Marthi et al., 2004 [12]) and use reinforcement learning to fill in low-level details.<br />
<br />
='''Learning Modular Policies from Sketches'''=<br />
The paper considers a multitask reinforcement learning problem arising from a family of infinite-horizon discounted Markov decision processes in a shared environment. This environment is specified by a tuple $(S, A, P, \gamma )$, with <br />
* $S$ a set of states<br />
* $A$ a set of low-level actions <br />
* $P : S \times A \times S \to R$ a transition probability distribution<br />
* $\gamma$ a discount factor<br />
<br />
Each task $t \in T$ is then specified by a pair $(R_t, \rho_t)$, with $R_t : S \to R$ a task-specific reward function and $\rho_t: S \to R$, an initial distribution over states. For a fixed sequence ${(s_i, a_i)}$ of states and actions obtained from a rollout of a given policy, we will denote the empirical return starting in state $s_i$ as $q_i = \sum_{j=i+1}^\infty \gamma^{j-i-1}R(s_j)$. In addition to the components of a standard multitask RL problem, we assume that tasks are annotated with sketches $K_t$ , each consisting of a sequence $(b_{t1},b_{t2},...)$ of high-level symbolic labels drawn from a fixed vocabulary $B$.<br />
<br />
==Model==<br />
The authors exploit the structural information provided by sketches by constructing for each symbol ''b'' a corresponding subpolicy $\pi_b$. By sharing each subpolicy across all tasks annotated with the corresponding symbol, their approach naturally learns the tied/shared abstraction for the corresponding subtask. Symbolic sketches can be thought of as a high level symbolic labels drawn from some fixed vocabulary, which initially are devoid of any meaning. Eventually the sketches get mapped into real policies and enable policy transfer and temporal abstraction. Learning occurs through a variant of the standard actor critic architecture that will be explained in later sections.<br />
<br />
[[File:Algorithm_MRL2.png|center|frame|Pseudo Algorithms for Modular Multitask Reinforcement Learning with Policy Sketches]]<br />
<br />
At every timestep, a subpolicy selects either a low-level action $a \in A$ or a special STOP action. The augmented state space is denoted as $A^+ := A \cup \{STOP\}$. At a high level, this framework is agnostic to the implementation of subpolicies: any function that takes a representation of the current state onto a distribution over $A^+$ will work fine with the approach.<br />
<br />
In this paper, $\pi_b$ is represented as a neural network. These subpolicies may be viewed as options of the kind described by [2], with the key distinction that they have no initiation semantics, but are instead invokable everywhere, and have no explicit representation as a function from an initial state to a distribution over final states (instead this paper uses the STOP action to terminate).<br />
<br />
Given a fixed sketch $(b_1, b_2,....)$, a task-specific policy $\Pi_r$ is formed by concatenating its associated subpolicies in sequence. In particular, the high-level policy maintains a sub-policy index ''i'' (initially 0), and executes actions from $\pi_{b_i}$ until the STOP symbol is emitted, at which point control is passed to bi+1 . We may thus think of as inducing a Markov chain over the state space $S \times B$, with transitions:<br />
[[File:MRL1.png|center|border|]]<br />
<br />
Note that $\Pi_r$ is semi-Markov with respect to projection of the augmented state space $S \times B$ onto the underlying state space ''S''. The complete family of task-specific policies is denoted as $\Pi := \bigcup_r \{ \Pi_r \}$. Assume each $\pi_b$ be an arbitrary function of the current environment state parameterized by some weight vector $\theta_b$. The learning problem is to optimize overall $\theta_b$ to maximize expected discounted reward<br />
[[File:MRL2.png|center|border|]]<br />
across all tasks $t \in T$.<br />
<br />
==Policy Optimization==<br />
<br />
Control policy parameterized by parameter vector $\theta$, $$\displaystyle \max_{\Theta}E[\sum_{t=0}^{H}R(s_{t})|\pi_{\theta}]$$ $\pi_{\theta}(u|s)$ is the probability of action u in state s. The details of policy optimization can be found here: https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf<br />
<br />
Here that optimization is accomplished through a simple decoupled actor–critic method. In a standard policy gradient approach, with a single policy $\pi$ with parameters $\theta$, the gradient steps are of the form:<br />
[[File:MRL3.png|center|border|]]<br />
<br />
where the baseline or “critic” c can be chosen independently of the future without introducing bias into the gradient. Recalling the previous definition of $q_i$ as the empirical return starting from $s_i$, this form of the gradient corresponds to a generalized advantage estimator with $\lambda = 1$. Here ''c'' achieves close to the optimal variance[6] when it is set exactly equal to the state-value function $V_{\pi} (s_i) = E_{\pi} q_i$ for the target policy $\pi$ starting in state $s_i$.<br />
[[File:MRL4.png|frame|]]<br />
<br />
In the case of generalizing to modular policies built by sequencing sub-policies the authors suggest to have one subpolicy per symbol but one critic per task. This is because subpolicies $\pi_b$ might participate in many compound policies $\Pi_r$, each associated with its own reward function $R_r$ . Thus individual subpolicies are not uniquely identified or differentiated with value functions. The actor–critic method is extended to allow decoupling of policies from value functions by allowing the critic to vary per-sample (per-task-and-timestep) based on the reward function with which that particular sample is associated. Noting that <br />
[[File:MRL5.png|center|border|]]<br />
i.e. the sum of gradients of expected rewards across all tasks in which $\pi_b$ participates, we have:<br />
[[File:MRL6.png|center|border|]]<br />
where each state-action pair $(s_{t_i}, a_{t_i})$ was selected by the subpolicy $\pi_b$ in the context of the task ''t''.<br />
<br />
Now minimization of the gradient variance requires that each $c_t$ actually depend on the task identity. (This follows immediately by applying the corresponding argument in [6] individually to each term in the sum over ''t'' in Equation 2.) Because the value function is itself unknown, an approximation must be estimated from data. Here it is allowed that these $c_t$ to be implemented with an arbitrary function approximator with parameters $\eta_t$ . This is trained to minimize a squared error criterion, with gradients given by<br />
[[File:MRL7.png|center|border|]]<br />
Alternative forms of the advantage estimator (e.g. the TD residual $R_t (s_i) + \gamma V_t(s_{i+1} - V_t(s_i))$ or any other member of the generalized advantage estimator family) can be used to substitute by simply maintaining one such estimator per task. Experiments show that conditioning on both the state and the task identity results in dramatic performance improvements, suggesting that the variance reduction given by this objective is important for efficient joint learning of modular policies.<br />
<br />
The complete algorithm for computing a single gradient step is given in Algorithm 1. (The outer training loop over these steps, which is driven by a curriculum learning procedure, is shown in Algorithm 2.) Note that this is an on-policy algorithm. In every step, the agent samples tasks from a task distribution provided by a curriculum (described in the following subsection). The current family of policies '''$\Pi$''' is used to perform rollouts for every sampled task, accumulating the resulting tuples of (states, low-level actions, high-level symbols, rewards, and task identities) into a dataset ''$D$''. Once ''$D$'' reaches a maximum size D, it is used to compute gradients with respect to both policy and critic parameters, and the parameter vectors are updated accordingly. The step sizes $\alpha$ and $\beta$ in Algorithm 1 can be chosen adaptively using any first-order method.<br />
<br />
==Curriculum Learning==<br />
<br />
For complex tasks, like the one depicted in Figure 3b, it is difficult for the agent to discover any states with positive reward until many subpolicy behaviors have already been learned. It is thus a better use of the learner’s time (and computational resources) to focus on “easy” tasks, where many rollouts will result in high reward from which relevant subpolicy behavior can be obtained. But there is a fundamental tradeoff involved here: if the learner spends a lot of its time on easy tasks before being told of the existence of harder ones, it may overfit and learn subpolicies that exhibit the desired structural properties or no longer generalize.<br />
<br />
To resolve these issues, a curriculum learning scheme (Bengio et al.,2009) is used that allows the model to smoothly scale up from easy tasks to more difficult ones without overfitting. Initially, the model is presented with tasks associated with short sketches. Once average reward on all these tasks reaches a certain threshold, the length limit is incremented. It is assumed that rewards across tasks are normalized with maximum achievable reward $0 < q_i < 1$ . Let $Er_t$ denote the empirical estimate of the expected reward for the current policy on task T. Then at each timestep, tasks are sampled in proportion $1-Er_t$, which by assumption must be positive.<br />
<br />
Intuitively, the tasks that provide the strongest learning signal are those in which <br />
# The agent does not on average achieve reward close to the upper bound<br />
# Many episodes result in a high reward.<br />
<br />
The expected reward component of the curriculum solves condition (1) by making sure that time is not spent on nearly solved tasks, while the length bound component of the curriculum addresses condition (2) by ensuring that tasks are not attempted until high-reward episodes are likely to be encountered. The experiments performed show that both components of this curriculum learning scheme improve the rate at which the model converges to a good policy.<br />
<br />
The complete curriculum-based training algorithm is written as Algorithm 2 above. Initially, the maximum sketch length $l_{max}$ is set to 1, and the curriculum initialized to sample length-1 tasks uniformly. For each setting of $l_{max}$, the algorithm uses the current collection of task policies to compute and apply the gradient step described in Algorithm 1. The rollouts obtained from the call to TRAIN-STEP can also be used to compute reward estimates $Er_t$ ; these estimates determine a new task distribution for the curriculum. The inner loop is repeated until the reward threshold $r_{good}$ is exceeded, at which point $l_{max}$ is incremented and the process repeated over a (now-expanded) collection of tasks.<br />
<br />
='''Experiments'''=<br />
[[File:MRL8.png|border|right|400px]]<br />
This paper considers three families of tasks: a 2-D Minecraft-inspired crafting game (Figure 3a), in which the agent must acquire particular resources by finding raw ingredients, combining them together in the correct order, and in some cases building intermediate tools that enable the agent to alter the environment itself; a 2-D maze navigation task that requires the agent to collect keys and open doors, and a 3-D locomotion task (Figure 3b) in which a quadrupedal robot must actuate its joints to traverse a narrow winding cliff.<br />
<br />
In all tasks, the agent receives a reward only after the final goal is accomplished. For the most challenging tasks, involving sequences of four or five high-level actions, a task-specific agent initially following a random policy essentially never discovers the reward signal, so these tasks cannot be solved without considering their hierarchical structure. These environments involve various kinds of challenging low-level control: agents must learn to avoid obstacles, interact with various kinds of objects, and relate fine-grained joint activation to high-level locomotion goals.<br />
<br />
==Implementation==<br />
In all of the experiments, each subpolicy is implemented as a neural network with ReLU nonlinearities and a hidden layer with 128 hidden units. Each critic is a linear function of the current state. Each subpolicy network receives as input a set of features describing the current state of the environment and outputs a distribution over actions. The agent acts at every timestep by sampling from this distribution. The gradient steps given in lines 8 and 9 of Algorithm 1 are implemented using RMSPROP with a step size of 0.001 and gradient clipping to a unit norm. They take the batch size D in Algorithm 1 to be 2000 and set $\gamma$= 0.9 in both environments. For curriculum learning, the improvement threshold $r_{good}$ is 0.8.<br />
<br />
==Environments==<br />
<br />
The environment in Figure 3a is inspired by the popular game Minecraft, but is implemented in a discrete 2-D world. The agent interacts with objects in the environment by executing a special USE action when it faces them. Picking up raw materials initially scattered randomly around the environment adds to an inventory. Interacting with different crafting stations causes objects in the agent’s inventory to be combined or transformed. Each task in this game corresponds to some crafted object the agent must produce; the most complicated goals require the agent to also craft intermediate ingredients, and in some cases build tools (like a pickaxe and a bridge) to reach ingredients located in initially inaccessible regions of the world.<br />
<br />
[[File:MRL_maze.png|boarder|right|400px]]<br />
<br />
The maze environment is very similar to “light world” described by [4], which can be seen in Figure 3c. The agent is placed in a discrete world consisting of a series of rooms, some of which are connected by doors. The agent needs to first pick up a key to open them. For our experiments, each task corresponds to a goal room that the agent must reach through a sequence of intermediate rooms. The agent senses the distance to keys, closed doors, and open doors in each direction. Sketches specify a particular sequence of directions for the agent to traverse between rooms to reach the goal. The sketch always corresponds to a viable traversal from the start to the goal position, but other (possibly shorter) traversals may also exist.<br />
<br />
The cliff environment (Figure 3b) proves the effectiveness of the approach in a high-dimensional continuous control environment where a quadrupedal robot [5] is placed on a variable-length winding path, and must navigate to the end without falling off. This is a challenging RL problem since the walker must learn the low-level walking skill before it can make any progress. The agent receives a small reward for making progress toward the goal, and a large positive reward for reaching the goal square, with a negative reward for falling off the path.<br />
<br />
==Multitask Learning==<br />
<br />
[[File:MRL9.png|border|center|800px]]<br />
The primary experimental question in this paper is whether the extra structure provided by policy sketches alone is enough to enable fast learning of coupled policies across tasks. The aim is to explore the differences between the approach described and relevant prior work that performs either unsupervised or weakly supervised multitask learning of hierarchical policy structure. Specifically, they compare their '''modular''' approach to:<br />
<br />
# Structured hierarchical reinforcement learners:<br />
#* the fully unsupervised '''option–critic''' algorithm of Bacon & Precup[1]<br />
#* a '''Q automaton''' that attempts to explicitly represent the Q function for each task/subtask combination (essentially a HAM [8] with a deep state abstraction function)<br />
# Alternative ways of incorporating sketch data into standard policy gradient methods:<br />
#* learning an '''independent''' policy for each task<br />
#* learning a '''joint policy''' across all tasks, conditioning directly on both environment features and a representation of the complete sketch<br />
<br />
The joint and independent models performed best when trained with the same curriculum described in Section 3.3, while the option–critic model performed best with a length–weighted curriculum that has access to all tasks from the beginning of training.<br />
<br />
Learning curves for baselines and the modular model is shown in Figure 4. It can be seen that in all environments, our approach substantially outperforms the baselines: it induces policies with substantially higher average reward and converges more quickly than the policy gradient baselines. It can further be seen in Figure 4c that after policies have been learned on simple tasks, the model is able to rapidly adapt to more complex ones, even when the longer tasks involve high-level actions not required for any of the short tasks.<br />
<br />
==Ablations==<br />
[[File:MRL10.png|border|right|400px]]<br />
In addition to the overall modular parameter-tying structure induced by sketches, the other critical component of the training procedure is the decoupled critic and the curriculum. Ablation analysis investigates the extent to which these components contribute to the performance.<br />
<br />
To evaluate the critic, consider three ablations: <br />
# Removing the dependence of the model on the environment state, in which case the baseline is a single scalar per task<br />
# Removing the dependence of the model on the task, in which case the baseline is a conventional generalized advantage estimator<br />
# Removing both, in which case the baseline is a single scalar, as in a vanilla policy gradient approach.<br />
<br />
Experiment results are shown in Figure 5a. Introducing both state and task dependence into the baseline leads to faster convergence of the model: the approach with a constant baseline achieves less than half the overall performance of the full critic after 3 million episodes. Introducing task and state dependence independently improve this performance; combining them gives the best result.<br />
<br />
Two other experiments are also performed as Figure 5b: starting with short examples and moving to long ones, and sampling tasks in inverse proportion to their accumulated reward. It is shown that both components help; prioritization by both length and weight gives the best results.<br />
<br />
==Zero-shot and Adaptation Learning==<br />
[[File:MRL11.png|border|left|320px]]<br />
In the final experiments, the authors test the model’s ability to generalize beyond the standard training condition. Consider two tests of generalization: a zero-shot setting, in which the model is provided with a sketch for the new task and must immediately achieve good performance, and an adaptation setting, where no sketch is provided, leaving the model to learn a suitable sketch via interaction. For zero-shot experiments, the concatenated policy is formed to describe the sketches of the held-out tasks, and repeatedly executing this policy (without learning) to obtain an estimate of its effectiveness. For adaptation experiments, consider ordinary RL over high-level actions B rather than low-level actions A, implementing the high-level learner with the same agent architecture as described in the section 'Learning Modular Policies from Sketches'. Results are shown in the table to the left. The held-out tasks are sufficiently challenging that the baselines are unable to obtain more than negligible reward: in particular, the joint model overfits to the training tasks and cannot generalize to new sketches, while the independent model cannot discover enough of a reward signal to learn in the adaptation setting. The modular model does comparatively well: individual subpolicies succeed in novel zero-shot configurations (suggesting that they have in fact discovered the behavior suggested by the semantics of the sketch) and provide a suitable basis for adaptive discovery of new high-level policies.<br />
<br />
='''Conclusion & Critique'''=<br />
The paper's contributions are:<br />
<br />
* A general paradigm for multitask, hierarchical, deep reinforcement learning guided by abstract sketches of task-specific policies.<br />
<br />
* A concrete recipe for learning from these sketches, built on a general family of modular deep policy representations and a multitask actor–critic training objective.<br />
<br />
This paper studies the problem of abstract hierarchical multiagent RL with policy sketches, high level descriptions of abstract actions. The work is related to much previous work in hierarchical RL, and adds some new elements by using neural implementations of prior work on hierarchical learning and skill representations. They have described an approach for multitask learning of deep multitask policies guided by symbolic policy sketches. By associating each symbol appearing in a sketch with a modular neural sub policy, they have shown that it is possible to build agents that share behavior across tasks in order to achieve success in tasks with sparse and delayed rewards. This process induces an inventory of reusable and interpretable sub policies which can be employed for zero-shot generalization when further sketches are available, and hierarchical reinforcement learning when they are not.<br />
<br />
Hierarchical Reinforcement Learning is a popular research topic now, it is interesting to compare this paper(which was presented in this class)[13], where the architecture follows the manager-and-workers style. In that work, the sub policy is decided by manager network. To finish a hierarchical task, each worker focuses on the sub task and optimizes the sub task. The difference is in that work, the sub policy is implicit and is trained in the training process. A further question for future work for this paper: would these sub policy be learned automatically instead of pre-defined?<br />
<br />
There are four drawbacks of the presented work. First, the presented ideas portrayed in this paper (for instance: symbolic specifications, actor-critic, shared representations) have all been explored in other works. Secondly, the aforementioned approach relies heavily on curriculum learning - which is, as we know, difficult to design as it is pretty complicated depending on the task in-hand. Thirdly, there has been no discussion on how curriculum can be designed for larger problems. Finally, this approach could be that building of different neural networks for each sub tasks could lead to overly complicated networks and is not in the spirit of building an efficient structure.<br />
<br />
When the paper is considered in the context of the OpenReview discussion with the authors [https://openreview.net/forum?id=H1kjdOYlx] some interesting criticisms of the paper are brought to light. Many of the criticisms are based on the novelty of the approach. Reviewers contest that, the paper doesn't seem to offer anything new, except for providing new implementation in the context of deep learning. The problem solved here can be seen as an extension to the option-learning problem with richer supervision. Hence it makes the problem simple to tackle with existing RL schemes. For example, learning from natural language instructions makes it easy to learn. And the model proposed in the paper shares many things common with existing hierarchical RL methods. The authors maintain, however, that the use of diagram instruction is indeed novel. Moreover, the authors focus on the ease of generation of these policy sketches (noting "The extra annotation we use here literally fits in a 10-line text file.") and commenting on the dramatic improvements that this approach provides. They further argue that this approach is fundamentally different from the traditional NLP approaches which are claimed to be equivalent. Traditional approaches are useless in the absence of natural language instructions as the model conditions on both the state and the instructions. Experiment 1 in this paper shows the added generality of the approach presented.<br />
<br />
A more salient criticism has to do with the utility of such an approach. Even if it is taken as granted that this approach, as presented, is novel and the results are replicable, what does this mean for future work or for applications of reinforcement learning? While the authors suggest that the sketches are, in fact, straightforward to create, practitioners are less likely to desire the generation of such sketches versus providing natural language instructions, for instance.<br />
<br />
= '''Resources''' =<br />
You can find a talk on this paper [https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=video&cd=3&cad=rja&uact=8&ved=0ahUKEwjNzPuBqM7XAhVK6mMKHQICAdEQtwIILDAC&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DNRIcDEB64x8&usg=AOvVaw1NHi2XExGXwhzzeJn5AcnR here].<br />
<br />
<br />
<br />
='''References'''=<br />
[1] Bacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement Learning Work-shop, 2015.<br />
<br />
[2] Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-tween MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intel-ligence, 112(1):181–211, 1999.<br />
<br />
[3] Stolle, Martin and Precup, Doina. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 212– 223. Springer, 2002.<br />
<br />
[4] Konidaris, George and Barto, Andrew G. Building portable options: Skill transfer in reinforcement learning. In IJ-CAI, volume 7, pp. 895–900, 2007.<br />
<br />
[5] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. Trust region policy optimization. In International Conference on Machine Learning, 2015b.<br />
<br />
[6] Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.<br />
<br />
[7] Andre, David and Russell, Stuart. Programmable reinforce-ment learning agents. In Advances in Neural Information Processing Systems, 2001.<br />
<br />
[8] Andre, David and Russell, Stuart. State abstraction for pro-grammable reinforcement learning agents. In Proceedings of the Meeting of the Association for the Advance-ment of Artificial Intelligence, 2002.<br />
<br />
[9] Author Jacob Andreas presenting the paper - https://www.youtube.com/watch?v=NRIcDEB64x8<br />
<br />
[10] Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., & Agapiou, J. (2016). Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems (pp. 3486-3494).<br />
<br />
[11] Parr, Ron and Russell, Stuart. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1998.<br />
<br />
[12] Marthi, Bhaskara, Lantham, David, Guestrin, Carlos, and Russell, Stuart. Concurrent hierarchical reinforcement learning. In Proceedings of the Meeting of the Association for the Advancement of Artificial Intelligence, 2004.<br />
<br />
[13] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.<br />
<br />
[14] J. Andreas, M. Rohrbach, T. Darrell, D. Klein. Learning to compose neural networks for question answering. In ''Proceedings of the Annual Meeting of the North American Chapter of the Association for Computational Linguistics'', 2016.<br />
<br />
= Appendix =<br />
The authors provide a brief appendix that gives a complete list of tasks and sketches. Asterisk * indicates that the task was held out for generalization experiments in Section 4.5, but included in the multitask experiments of Sections 4.3 and 4.4.<br />
<br />
[[File: tasks.PNG]]</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering&diff=31626Hierarchical Question-Image Co-Attention for Visual Question Answering2017-11-28T18:33:38Z<p>Jdeng: /* Critique */</p>
<hr />
<div>__TOC__<br />
== Paper Summary ==<br />
{| class="wikitable"<br />
|-<br />
|'''Conference'''<br />
| <br />
* NIPS 2016<br />
* Presented as spotlight oral: [https://www.youtube.com/watch?v=m6t9IFdk0ms Youtube link]<br />
* 85 citations so far<br />
|-<br />
| '''Authors'''<br />
|Jiasen Lu, Jianwei Yang, Dhruv Batra, '''Devi Parikh'''<br />
|-<br />
|'''Abstract'''<br />
|''A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.''<br />
|}<br />
= Introduction =<br />
'''Visual Question Answering (VQA)''' is a recent problem in computer vision and<br />
natural language processing that has garnered a large amount of interest from<br />
the deep learning, computer vision, and natural language processing communities.<br />
In VQA, an algorithm needs to answer text-based questions about images in<br />
natural language as illustrated in Figure 1.<br />
<br />
[[File:vqa-overview.png|thumb|600px|center|Figure 1: Illustration of VQA system whereby machine learning algorithm answers a visual question asked by an user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)]]<br />
<br />
Recently, ''visual-attention'' based models have gained traction for VQA tasks, where the<br />
attention mechanism typically produces a spatial map highlighting image regions<br />
relevant for answering the visual question about the image. However, to correctly answer the <br />
question, the machine not only needs to understand or "attend"<br />
regions in the image but also the parts of question as well. In this paper, authors have proposed a novel ''co-attention''<br />
technique to combine "where to look" or visual-attention along with "what words<br />
to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving <br />
upon existing state of the art results.<br />
<br />
== "Attention" Models ==<br />
You may skip this section if you already know about "attention" in<br />
the context of deep learning. Since this paper talks about "attention" almost<br />
everywhere, I decided to put this section to give a very informal and brief<br />
introduction to the concept of the "attention" mechanism especially visual "attention"; <br />
however, it can be expanded to any other type of "attention".<br />
<br />
Visual attention in CNN is inspired by the biological visual system. As humans,<br />
we have the ability to focus our cognitive processing onto a subset of the<br />
environment that is more relevant to the given situation. Imagine, you witness<br />
a bank robbery where robbers are trying to escape on a car, as a good citizen,<br />
you will immediately focus your attention on number plate and other physical<br />
features of the car and robbers in order to give your testimony later, however, you may not remember things which otherwise interests you more. <br />
Such selective visual attention for a given context (robbery in above example) can also be implemented in<br />
traditional CNNs as well. This allows CNNs to be more robust and superior for certain tasks and it even helps algorithm designers to visualize what spatial features (regions within image) were more important than others. Attention guided<br />
deep learning is particularly very helpful for image caption and VQA tasks.<br />
<br />
== Role of Visual Attention in VQA ==<br />
This section is not a part of the actual paper that is been summarized, however, it gives an overview<br />
of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping readers to absorb and understand the actual proposed ideas from the paper more effortlessly. Das et al. [5] provided a research study on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images compared with deep models. The concept of "visual-attention" has also been implemented in VQA tasks, which is explored in [6].<br />
<br />
Generally for implementing attention, network tries to learn the conditional <br />
distribution $P_{i \in [1,n]}(Li|c)$ representing individual importance for all the features <br />
extracted from each of the dsicrete $n$ locations within the image <br />
conditioned on some context vector $c$. In order words, given $n$ features <br />
$L_i = [L_1, ..., L_n]$ from $n$ different spacial regions within the image (top-left, top-middle, top-right, and so on), <br />
then "attention" module learns a parameteric function $F(c;\theta)$ that outputs an importance mapping <br />
of each of these individual feature for a given context vector $c$ or a discrete probability distribution <br />
of size $n$, can be achived by $softmax(n)$. <br />
<br />
In order to incorporate the visual attention in VQA task, one can define context vector $c$ <br />
as a representation of the visual question asked by a user (using RNN perhaps LSTM). The context $c$ can then be used to generate an attention map for corresponding image locations (as shown in Figure 2) further improving the accuracy on final end-to-end training. <br />
Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further <br />
specialization of the similar ideas.<br />
<br />
[[File:attention-vqa-general.png|thumb|700px|center|Figure 2: Different attention maps generated based on the given visual question. Regions with most "attention" or importance is whitened, machine learning model has learned to steer its attention based on the given question.]]<br />
<br />
== Motivation and Main Contributions ==<br />
So far, all attention models for VQA in literature have focused on the problem of identifying "where<br />
to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to<br />
listen to" or '''question attention''' is equally important. Consider the questions "how many horses are<br />
in this image?" and "how many horses can you see in this image?". They have the same meaning,<br />
essentially captured by the first three words. A machine that attends to the first three words would<br />
arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question.<br />
Motivated by this observation, in addition to reasoning about visual attention, the paper has addressed the<br />
problem of question attention. Basically, main contributions of the paper are as follows.<br />
<br />
* A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.<br />
* A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level. These co-attended features are then recursively combined from word level to question<br />
level for the final answer prediction<br />
* A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.<br />
* Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in the model<br />
<br />
= Related Work =<br />
<br />
The authors claim that no previous work has explored combined language-visual attention in VQA. There are now several other papers that do use such a combined approach. However, chronologically the present paper appears to the be first to do so and is referenced in the following works.<br />
<br />
# Dual Attention Networks (DANs) [7]: These authors use a soft attention mechanism for the image. It computes weights for each input vector using a two layer feed forward neural network and a softmax function. For textual attention, the authors use a very similar mechanism as for visual attention. Image features are extracted using a 152-layer ResNet and bidirectional LSTMs are used to generate text features.<br />
<br />
# Multi-level Attention Networks [8]: The authors use a context-aware visual attention algorithm. A bidirectional Gate Recurrent Unit (GRU) layer is used with the feature vectors from the last layer of a CNN as input. The GRU is run in both forward and backward directions for each region of the image generating a context-aware visual representation of the image. Textual attention is achieved using deep neural network concept detector trained on COCO. A second network is trained to measure the relevance between the question and the learned concepts. Lastly, textual and visual attention is combined by computing a joint feature vector with a softmax layer to select the correct answer.<br />
<br />
Both papers present state-of-the-art results.<br />
<br />
= Method =<br />
This section is broken down into four parts: '''(i)''' notations used within the paper and also throughout this summary, '''(ii)''' hierarchical representation for a visual question, '''(iii)''' the proposed co-attention mechanism and<br />
'''(iv)''' predicting answers.<br />
<br />
== Notations ==<br />
{| class="wikitable"<br />
|-<br />
|'''Notation'''<br />
|'''Explaination'''<br />
|-<br />
|$Q = \{q_1,...q_T\}$<br />
|One-hot encoding of a visual question with $T$ words. Paper uses three different representation of visual question, one for each level of hierarchy, they are as follows: <br />
# $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question<br />
# $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question<br />
# $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question<br />
$Q^{w,p,s}$ has exactly $T$ number of embeddings in it (sequential data with temporal dimension), regardless of its position in the hierarchy i.e. word, phrase or question. <br />
|-<br />
|$V = {v_1,..,v_N}$<br />
|$V$ represented various vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. $V$ collectively covers the entire spatial reachings of the image. One can extract these location sensitive features from convolution layer of CNN.<br />
|-<br />
|$\hat{v}^r$ and $\hat{q}^r$<br />
|The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy. <br />
For example, at "word" level, $a^q_w$ and $a^v_w$ is a probability distribution representing importance of each words in visual question and each location within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors for the given question and image with attention maps ($a^q_w$ and $a^v_w$ applied) at the "word" level, and similarly for "phrase" and "question" level as well.<br />
|}<br />
'''Note:''' Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).<br />
<br />
== Question Hierarchy ==<br />
There are three levels of granularities for their hierarchical representation of a visual question: '''(i)''' word, '''(ii)''' phrase and '''(iii)''' question level. It is important to note, each level depends on the previous one, so, phrase level representations are extracted from word level and question level representations come from phrase level as depicted in Figure 4.<br />
<br />
[[File:hierarchy2.png|thumb|Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)]]<br />
[[File:hierarchy.PNG|thumb|Figure 4: Another figure illustrating hierarchical question encoding in details]]<br />
<br />
=== Word Level ===<br />
1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embeddings of a visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. Paper has learned this transformation end-to-end instead of some pretrained models such as word2vec.<br />
<br />
=== Phrase Level ===<br />
Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. <br />
Concretely, at each word location, the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram are computed as illustrated in Figure 4. For the ''t-th'' word, <br />
the output from convolution for window size ''s'' is given by<br />
<br />
$$<br />
\hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\}<br />
$$<br />
<br />
Where $W_c^s$ is the weight parameters. The features from three n-grams are combined together using ''maxpool'' operator to obtain the phrase-level embeddings vectors.<br />
<br />
$$<br />
q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\}<br />
$$<br />
<br />
=== Question Level ===<br />
For question level representation, LSTM is used to encode the sequence $q_t^p$ after max-pooling. The corresponding question-level feature at time ''t'' $q_t^s$ is the <br />
LSTM hidden vector at time ''t'' $h_t$.<br />
<br />
$$<br />
\begin{align*}<br />
h_t &= LSTM(q_t^p, h_{t-1})\\<br />
q_t^s &= h_t, \quad t \in \{1,2,...,T\}<br />
\end{align*}<br />
$$<br />
<br />
== Co-Attention Mechanism ==<br />
Paper has proposed two co-attention mechanisms.<br />
{| class="wikitable"<br />
|-<br />
|'''Parallel co-attention'''<br />
|Generates image and question attention simultaneously.<br />
|-<br />
|'''Alternating co-attention'''<br />
|Sequentially alternates between generating image and question attentions.<br />
|}<br />
These co-attention mechanisms are executed at all three levels of the question hierarchy yielding $\hat{v}^r$ and $\hat{q}^r$ <br />
where $r$ is levels in hierarchy i.e. $r \in \{w,p,s\}$ (refer to [[:Notations]] section).<br />
<br />
<br />
=== Parallel Co-Attention ===<br />
[[File:parallewl-coattention.png|thumb|Figure 5: Parallel co-attention mechanism (ref: Figure 2 (a) from original paper)]]<br />
Parallel co-attention attends to the image and question simultaneously as shown in Figure 5. In the paper, "affinity matrix" has been mentioned as the way to calculate the<br />
"attention" or affinity for every pair of image location and question part for each level in the hierarchy (word, phrase, and question). Remember, there are $N$ image locations and $T$ <br />
question parts, thus affinity matrix is $\mathbb{R}^{T \times N}$. Specifically, for a given image with<br />
feature map $V \in \mathbb{R}^{d \times N}$, and the question representation $Q \in \mathbb{R}^{d \times T}$, the affinity matrix $C \in \mathbb{R}^{T \times N}$<br />
is calculated by<br />
<br />
$$<br />
C = tanh(Q^TW_bV)<br />
$$<br />
<br />
where,<br />
* $W_b \in \mathbb{R}^{d \times d}$ contains the weights. <br />
<br />
After computing this affinity matrix, one possible way of<br />
computing the image (or question) attention is to simply maximize out the affinity over the locations<br />
of other modality, i.e. $a_v[n] = \underset{i}{max}(C_{i,n})$ and $a_q[t] = \underset{j}{max}(C_{t,j})$. Their notation here is not rigorous. $a_v[n]$ is actually row number $\underset{i}{argmax}(C_{i,n})$ of matrix $C$, and $a_q[t]$ is column number $\underset{j}{argmax}(C_{t,j})$ of that matrix. Instead of choosing the max activation, paper has considered the affinity matrix as a feature and learn to predict image and question attention <br />
maps via the following<br />
<br />
$$<br />
H_v = tanh(W_vV + (W_qQ)C), \quad H_q = tanh(W_qQ + (W_vV )C^T )\\<br />
a_v = softmax(w_{hv}^T Hv), \quad aq = softmax(w_{hq}^T H_q)<br />
$$<br />
<br />
where,<br />
* $W_v, W_q \in \mathbb{R}^{k \times d}$, $w_{hv}, w_{hq} \in \mathbb{R}^k$ are the weight parameters. <br />
* $a_v \in \mathbb{R}^N$ and $a_q \in \mathbb{R}^T$ are the attention probabilities of each image region $v_n$ and word $q_t$ respectively. <br />
<br />
The intuition behind above equation is that, image/question attention maps should be the function of question and image features jointly, therefore, authors have<br />
developed two intermediate parametric functions $H_v$ and $H_q$ that takes affinity matrix $C$, image features $V$ and question features $Q$ as input. The affinity matrix $C$ <br />
transforms question attention space to image attention space (vice versa for $C^T$). Based on the above attention weights, the image and question attention vectors are calculated<br />
as the weighted sum of the image features and question features, i.e.,<br />
<br />
$$\hat{v} = \sum_{n=1}^{N}{a_n^v v_n}, \quad \hat{q} = \sum_{t=1}^{T}{a_t^q q_t}$$<br />
<br />
The parallel co-attention is done at each level in the hierarchy, leading to $\hat{v}^r$ and $\hat{q}^r$ where $r \in \{w,p,s\}$. The reason they are using $tanh$ <br />
for $H_q$ and $H_v$is not specified in the paper. But my assumption is that they want to have negative impacts for certain unfavorable pair of image location and question fragment. Unlike $RELU$ or $Sigmoid$, $tanh$ can be between $[-1, 1]$ thus appropriate choice.<br />
<br />
=== Alternating Co-Attention ===<br />
[[File:alternating-coattention.png|thumb|Figure 6: Alternating co-attention mechanism (ref: Figure 2 (b) from original paper)]]<br />
In this attention mechanism, authors sequentially alternate between generating image and question attention as shown in Figure 6. <br />
Briefly, this consists of three steps<br />
<br />
# Summarize the question into a single vector $q$<br />
# Attend to the image based on the question summary $q$<br />
# Attend to the question based on the attended image feature.<br />
<br />
Concretely, paper defines an attention operation $\hat{x} = \mathcal{A}(X, g)$, which takes the image (or question)<br />
features $X$ and attention guidance $g$ derived from the question (or image) as inputs, and outputs the<br />
attended image (or question) vector. The operation can be expressed in the following steps<br />
<br />
$$<br />
\begin{align*}<br />
H &= tanh(W_xX + (W_gg)𝟙^T)\\<br />
a_x &= softmax(w_{hx}^T H)\\<br />
\hat{x} &= \sum{a_i^x x_i}<br />
\end{align*}<br />
$$<br />
<br />
where,<br />
* $𝟙$ is a vector with all elements to be 1. <br />
* $W_x, W_g \in \mathbb{R}^{k\times d}$ and $w_{hx} \in \mathbb{R}^k$ are parameters. <br />
* $a_x$ is the attention weight of feature $X$.<br />
<br />
Breifly,<br />
* At the first step of alternating coattention, $X = Q$, and $g$ is $0$. <br />
* At the second step, $X = V$ where $V$ is the image features, and the guidance $g$ is intermediate attended question feature $\hat{s}$ from the first step<br />
* Finally, we use the attended image feature $\hat{v}$ as the guidance to attend the question again, i.e., $X = Q$ and $g = \hat{v}$. <br />
<br />
Similar to the parallel co-attention, the alternating co-attention is also done at each level of the hierarchy, leading to $\hat{v}^r$ <br />
and $\hat{q}^r$ where $r \in \{w,p,s\}$.<br />
<br />
== Encoding for Predicting Answers ==<br />
[[File:answer-encoding-for-prediction.png|thumb|Figure 7: Encoding for predicitng answers (source: Figure 3 (b) of original paper on page #5)]]<br />
Paper treats predicting final answer as a classification task. It was surprising because I always thought answer would be a sequence, however, by using MLP it is apparent that answer must be a single word. Co-attended image and question features from all three levels are combined together for the final prediction, see Figure 7. Basically, a multi-layer perceptron (MLP) is deployed to recursively encode the attention features as follows.<br />
$$<br />
\begin{align*}<br />
h_w &= tanh(W_w(\hat{q}^w + \hat{v}^w))\\<br />
h_p &= tanh(W_p[(\hat{q}^p + \hat{v}^p), h_w])\\<br />
h_s &= tanh(W_s[(\hat{q}^s + \hat{v}^s), h_p])\\<br />
p &= softmax(W_hh^s)<br />
\end{align*}<br />
$$<br />
<br />
where <br />
* $W_w, W_p, W_s$ and $W_h$ are the weight parameters. <br />
* $[·]$ is the concatenation operation on two vectors. <br />
* $p$ is the probability of the final answer.<br />
<br />
= Experiments =<br />
Evaluation of the proposed model is performed using two datasets, the VQA dataset [1] and the COCO-QA dataset [2].<br />
<br />
* '''VQA dataset''' is the largest dataset for this problem, containing human annotated questions and answers on Microsoft COCO dataset.<br />
* '''COCO-QA dataset''' is automatically generated from captions in the Microsoft COCO dataset.<br />
<br />
The proposed approach seems to outperform most of the state-of-art techniques as shown in Table 1 and 2.<br />
<br />
[[File:result-vqa.png|thumb|700px|center|Table 1: Results on the VQA dataset. “-” indicates the results is not available. (ref: Table 1 of original paper page #6)]]<br />
<br />
[[File:result-coco-qa.png|thumb|700px|center|Table 2: Results on the COCO-QA dataset. “-” indicates the results is not available (ref: Table 2 of original paper page #7)]]<br />
<br />
==Ablation Study==<br />
In this part, the authors quantified the importance of individual components in the infrastructure. The idea is re-training the model with ablated components. The detailed settings are listed as follows.<br />
* Image Attention alone(to verify that improvements are not the result of better optimization or better CNN features)<br />
* Question Attention alone<br />
* W/O Conv(replace convolution and pooling by stacking another word embedding layer on the top of word level outputs)<br />
* W/OW-Atten(replace the word level attention with a uniform distribution)<br />
* W/O P-Atten(no phrase level co-attention is performed, and the phrase level attention is set to be uniform. Word and question level co-attentions are still modeled)<br />
* W/O Q-Atten(no question level co-attention is performed while word and phrase level co-attentions are still modeled)<br />
<br />
The results of such ablation experiments can be seen in Table 3. It should be noted that "attention" at top of the hierarchy i.e question level or phrase level matters the most as seen in Table 3.<br />
[[FILE: ablation.png|center|thumb|400px|Table 3: Results of ablation experiments on the VQA dataset]]<br />
<br />
Compared to the full model, it is clear that the ablated model under-performs generally. However, it is interesting to see in some settings, the full model does not excel the ablated model.<br />
<br />
= Qualitative Results =<br />
We now visualize some co-attention maps generated by their method in Figure 8. <br />
<br />
{|class="wikitable"<br />
|'''Word level'''<br />
|<br />
* Model attends mostly to the object regions in an image, and objects at questions as well e.g., heads, bird. <br />
|-<br />
|'''Phrase level'''<br />
|<br />
*Image attention has different patterns across images. <br />
** For the first two images, the attention transfers from objects to background regions. <br />
** For the third image, the attention becomes more focused on the objects. <br />
** Reason for different attention could be perhaps caused by the different question types. <br />
* On the question side, their model is capable of localizing the key phrases in the question, thus essentially discovering the question types in the dataset. <br />
* For example, their model pays attention to the phrases “what color” and “how many snowboarders”. <br />
|-<br />
|'''Question level'''<br />
|<br />
* Image attention concentrates mostly on objects. <br />
* Their model successfully attends to the regions in images and phrases in the questions appropriate for answering the question, e.g., “color of the bird” and bird region.<br />
|}<br />
<br />
Because their model performs co-attention at three levels, it often captures complementary information from<br />
each level, and then combines them to predict the answer. However, it somewhat un-intuitive to visualize the phrase and question level attention mapping applied directly to the words of the question, since phrase and question level features are compound features from multiple words, thus their attention contribution on the actual words from the question cannot be clearly understood. <br />
<br />
[[File:visualization-co-attention.png|thumb|800px|center|Figure 8: Visualization of image and question co-attention maps on the COCO-QA dataset. From left to right:<br />
original image and question pairs, word level co-attention maps, phrase level co-attention maps and question<br />
level co-attention maps. For visualization, both image and question attentions are scaled (from red:high to<br />
blue:low). (ref: Figure 4 of original paper page #8)]]<br />
<br />
= Conclusion =<br />
* A hierarchical co-attention model for visual question answering is proposed. <br />
* Coattention allows the model to attend to different regions of the image as well as different fragments of the question. <br />
* Question is hierarchically represented at three levels to capture information from different granularities. <br />
* Visualization shows model co-attends to interpretable regions of images and questions for predicting the answer. <br />
* Though their model was evaluated on visual question answering, it can be potentially applied to other tasks involving vision and language.<br />
== Critique ==<br />
* This is a very intuitively relevant idea that closely resembles the way human brains tackle VQA tasks. Therefore this could be developed more into delivering sequence based answers and sentence generation. Therefore, the authors could have used a more powerful, more scalable word-encoding technique such as Glove or Bag-of-words which result in smaller dimensional vectors, thereby opening doors for more learning techniques like sentence-answer-generation. Since word-encoding is treated as a separate task here, Bag-of-words could work, but if we need a more temporal technique, we could use the Position Encoding mechanism [3] which accounts for the position of the word in the sequence itself. This abstraction could help the model generalize better to a multitude of tasks.<br />
<br />
* The idea that image attentions and question attentions can jointly guide each other makes sense. However, if the image is complex or the question itself is too long, will such side attention be misleading? A further study could be: compared to a simple question, whether a long and complex question will influence the performance of the model.<br />
<br />
* The idea of the paper seems great, but 0.2% improvement over the state-of-the-art performance on VQA dataset isn't significant. It would have been good to show some incorrect samples to indicate why the error was still so high. In fact, there is already a new paper [4] that won the 2017 VQA challenge and it significantly outperforms all the previous methods on VQA dataset giving an accuracy of 69%. External training questions/answers the Visual Genome (VG) are used in this model. It is not fair to compare the results directly. But it is interesting to see that the their models can be benefit from larger datasets.<br />
<br />
= References =<br />
# K. Kafle and C. Kanan, “Visual Question Answering: Datasets, Algorithms, and Future Challenges,” Computer Vision and Image Understanding, Jun. 2017.<br />
# Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. NIPS, 2015.<br />
# Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems (NIPS) 28<br />
# Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR 2017<br />
# A. Das, H. Agrawal, L. Zitnick, D. Parikh and D. Batra, "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?", Computer Vision and Image Understanding, vol. 163, pp. 90-100, 2017.<br />
# Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia, "ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering", Computer Vision and Pattern Recognition, 2015.<br />
# Ha, J., Kim, J., & Nam, H. (2017). Dual Attention Networks for Multimodal Reasoning and Matching. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2156-2164.<br />
# Fu, J., Mei, T., Rui, Y., & Yu, D. (2017). Multi-level Attention Networks for Visual Question Answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4187-4195.<br />
<br />
<br />
'''Implementation''': [https://github.com/jiasenlu/HieCoAttenVQA github.com/jiasenlu/HieCoAttenVQA]</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders&diff=31475STAT946F17/Conditional Image Generation with PixelCNN Decoders2017-11-25T03:58:26Z<p>Jdeng: /* Conditioning on Portrait Embeddings */</p>
<hr />
<div>=Introduction=<br />
This works is based of the widely used PixelCNN and PixelRNN, introduced by Oord et al. in [[#Reference|[1]]]. From the previous work, the authors observed that PixelRNN performed better than PixelCNN, however, PixelCNN was faster to compute as you can parallelize the training process. In this work, Oord et al. [[#Reference|[2]]] introduced a Gated PixelCNN, which is a convolutional variant of the PixelRNN model, based on PixelCNN. In particular, the Gated PixelCNN uses explicit probability densities to generate new images using autoregressive connections to model images through pixel-by-pixel computation by decomposing the joint image distribution as a product of conditionals [[#Reference|[6]]]. The Gated PixelCNN is an improvement over the PixelCNN by removing the "blindspot" problem, and to yield a better performance, the authors replaced the ReLU units with sigmoid and tanh activation function. The proposed Gated PixelCNN combines the strength of both PixelRNN and PixelCNN - that is by matching the log-likelihood of PixelRNN on both CIFAR and ImageNet along with the quicker computational time presented by the PixelCNN [[#Reference|[9]]]. Moreover, the authors also introduced a conditional Gated PixelCNN variant (called Conditional PixelCNN) which has the ability to generate images based on class labels, tags, as well as latent embedding to create new image density models. These embeddings capture high level information of an image to generate a large variety of images with similar features; for instance, the authors can generate different poses of a person based on a single image by conditioning on a one-hot encoding of the class. This approach provided insight into the invariances of the embeddings which enabled the authors to generate different poses of the same person based on a single image. Finally, the authors also presented a PixelCNN Auto-encoder variant which essentially replaces the deconvolutional decoder with the PixelCNN.<br />
<br />
=Gated PixelCNN=<br />
Pixel-by-pixel is a simple generative method wherein given an image of dimension of dimension $x_{n^2}$, we iterate, employ feedback and capture pixel densities from every pixel to predict our "unknown" pixel density $x_i$. To do this, the traditional PixelCNNs and PixelRNNs adopted the joint distribution p(x), wherein the pixels of a given image is the product of the conditional distributions. Hence, the authors employ autoregressive models which means they just use plain chain rule for joint distribution, depicted in Equation 1. So the very first pixel is independent, second depend on first, third depends on first and second and so on. Basically you just model your image as sequence of points where each pixel depends linearly on previous ones. Equation 1 depicts the joint distribution where x_i is a single pixel:<br />
<br />
$$p(x) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1})$$<br />
<br />
where $p(x)$ is the generated image, $n^2$ is the number of pixels, and $p(x_i | x_1, ..., x_{i-1})$ is the probability of the $i$th pixel which depends on the values of all previous pixels. It is important to note that $p(x_0, x_1, ..., x_{n^2})$ is the joint probability based on the chain rule - which is a product of all conditional distributions $p(x_0) \times p(x_1|x_0) \times p(x_2|x_1, x_0)$ and so on. Figure 1 provides a pictorial understanding of the joint distribution which displays that the pixels are computed pixel-by-pixel for every row, and the forthcoming pixel depends on the pixels values above and to the left of the pixel in concern. <br />
<br />
[[File:xi_img.png|500px|center|thumb|Figure 1: Computing pixel-by-pixel based on joint distribution.]]<br />
<br />
Hence, for every pixel, we use the softmax layer towards the end of the PixelCNN to predict the pixel intensity value (i.e. the highest probable index from 0 to 255). Figure 2 [[#Reference|[7]]] illustrates how to predict (generate) a single pixel value.<br />
<br />
[[File:single_pixel.png|500px|center|thumb|Figure 2: Predicting a single pixel value based on softmax layer.]]<br />
<br />
So, the PixelCNN is supposedly to maps a neighborhood of pixels to prediction for the next pixel. That is, to generate pixel $x_i$ the model can only condition on the previously generated pixels $x_1 , ..., x_{i−1}$; so every conditional distribution is modelled by a convolutional neural network. For instance, given a $5\times5$ image (let's represent each pixel as an alphabet and zero-padded), and we have a filter of dimension $3\times3$ that slides over the image which multiplies each element and sums them together to produce a single response. However, we cannot use this filter because pixel $a$ should not know the pixel intensities for $b,f,g$ (future pixel values). To counter this issue, the authors use a mask on top of the filter to only choose prior pixels and zeroing the future pixels to negate them from calculation - depicted in Figure 3 [[#Reference|[7]]]. Hence, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution are masked - that means the model cannot read pixels below (or strictly to the right) of the current pixel to make its predictions [[#Reference|[3]]]. <br />
<br />
[[File:masking1.png|200px|center|thumb|Figure 3: Masked convolution for a $3\times3$ filter.]]<br />
[[File:masking2.png|500px|center|thumb|Figure 4: Masked convolution for each convolution layer.]]<br />
<br />
Hence, for each pixel there are three colour channels (R, G, B) which are modeled successively, with B conditioned on (R, G), and G conditioned on R [[#Reference|[8]]]. This is achieved by splitting the feature maps at every layer of the network into three and adjusting the centre values of the mask tensors, as depicted in Figure 5 [[#Reference|[8]]]. The 256 possible values for each colour channel are then modelled using a softmax.<br />
<br />
[[File:rgb_filter.png|300px|right|thumb|Figure 5: RGB Masking.]]<br />
<br />
Now, from Figure 6, notice that as the filter with the mask slides across the image, pixel $f$ does not take pixels $c, d, e$ into consideration (breaking the conditional dependency) - this is where we encounter the "blind spot" problem. <br />
<br />
[[File:blindspot.gif|500px|center|thumb|Figure 6: The blindspot problem.]]<br />
<br />
It is evident that the progressive growth of the receptive field of the masked kernel over the image disregards a significant portion of the image. For instance, when using a 3x3 filter, roughly quarter of the receptive field is covered by the "blind spot", meaning that the pixel contents are ignored in that region. In order to address the blind spot, the authors use two filters (horizontal and vertical stacks) in conjunction to allow for capturing the whole receptive ﬁeld, depicted in Figure{vh_stack}. In particular, the horizontal stack conditions the current row, and the vertical stack conditions all the rows above the current pixel. It is observed that the vertical stack, which does not have any masking, allows the receptive ﬁeld to grow in a rectangular fashion without any blind spot. Thereafter, the outputs of both the stacks, per-layer, is combined to form the output. Hence, every layer in the horizontal stack takes an input which is the output of the previous layer as well as that of the vertical stack. By splitting the convolution into two different operations enables the model to access all pixels prior to the pixel of interest. <br />
<br />
[[File:vh_stack.png|500px|center|thumb|Figure 7: Vertical and Horizontal stacks.]]<br />
<br />
=== Horizontal Stack ===<br />
For the horizontal stack (in purple for Figure 7), the convolution operation conditions only on the current row, so it has access to left pixels. In essence, we take a $1 \times n//2+1$ convolution with shift (pad and crop) rather than $1\times n$ masked convolution. So, we perform convolution on the row with a kernel of width 2 pixels (instead of 3) from which the output is padded and cropped such that the image shape stays the same. Hence, the image convolves with kernel width of 2 and without masks.<br />
<br />
[[File:h_mask.png|500px|center|thumb|Figure 8: Horizontal stack.]]<br />
<br />
Figure 8 [[#Reference|[3]]] shows that the last pixel from output (just before ‘Crop here’ line) does not hold information from last input sample (which is the dashed line).<br />
<br />
=== Vertical Stack ===<br />
Vertical stack (blue) has access to all top pixels. The vertical stack is of kernel size $n//2 + 1 \times n$ with the input image being padded with another row in the top and bottom. Thereafter, we perform the convolution operation, and crop the image to force the predicted pixel to be dependent on the upper pixels only (i.e. to preserve the spatial dimensions)[[#Reference|[3]]]. Since the vertical filter does not contain any "future" pixel values, only upper pixel values, no masking is incorporated as no target pixel is touched. However, the computed pixel from the vertical stack yields information from top pixels and sends that info to horizontal stack (which supposedly eliminates the "blindspot problem").<br />
<br />
[[File:vertical_mask.gif|500px|center|thumb|Figure 9: Vertical stack.]]<br />
<br />
From Figure 9 it is evident that the image is padded (left) with kernel height zeros, then convolution operation is performed from which we crop the output so that rows are shifted by one with respect to input image. Hence, it is noticeable that the first row of output does not depend on first (real, non-padded) input row. Also, the second row of output only depends on the first input row - which is the desired behaviour.<br />
<br />
=== Gated block ===<br />
The PixelRNNs are observed to perform better than the traditional PixelCNN for generating new images. This is because the spatial LSTM layers in the PixelRNN allows for every layer in the network to access the entire neighborhood of previous pixels. The PixelCNN, however, only takes into consideration the neighborhood region and the depth of the convolution layers to make its predictions [[#Reference|[4]]]. Another advantage for the PixelRNN is that this network contains multiplicative units (in the form of the LSTM gates), which may help it to model more complex interactions [[#Reference|[3]]]. To address the benefits of PixelRNN and append it onto the newly proposed Gated PixelCNN, the authors replaced the rectified linear units between the masked convolutions with the following custom-made gated activation function, depicted in Equation 2:<br />
<br />
$$y = tanh(W_{k,f} \ast x) \odot \sigma(W_{k,g} \ast x)$$<br />
<br />
where $\sigma$ is the sigmoid non-linearity, $k$ is the number of the layer, $f, g$ are the different feature maps, $\odot$ is the element-wise product and $\ast$ is the convolution with the input. This function is the key ingredient that cultivates the Gated PixelCNN model. <br />
<br />
Figure 10 provides a pictorial illustration of a single layer in the Gated PixelCNN architecture; wherein the vertical stack contributes to the horizontal stack with the $1\times1$ convolution - going the other way would break the conditional distribution. In other words, the horizontal and vertical stacks are sort of independent, wherein vertical stack should not access any information horizontal stack has - otherwise it will have access to pixels it shouldn’t see. However, vertical stack can be connected to vertical as it predicts pixel following those in vertical stack. In particular, the convolution operations are shown in green (which are masked), element-wise multiplications and additions are shown in red. The convolutions with $W_f$ and $W_g$ are not combined into a single operation (which is essentially the masked convolution) to increase parallelization shown in blue. The parallelization now splits the $2p$ features maps into two groups of $p$. Finally, the authors also use the residual connection in the horizontal stack. Moreover, the $(n \times 1)$ and $(n \times n)$ are the masked convolutions which can also be implemented as $([n/2] \times 1)$ and $([n/2] \times n)$ which are convolutions followed by a shift in pixels by padding and cropping to get the original dimension of the image.<br />
<br />
[[File:gated_block.png|500px|center|thumb|Figure 10: Gated block.]]<br />
<br />
In essence, PixelCNN typically consists of a stack of masked convolutional layers that takes an $N \times N \times 3$ image as input and produces $N \times N \times 3 \times 256$ (probability of pixel intensity) predictions as output. During sampling the predictions are sequential: every time a pixel is predicted, it is fed back into the network to predict the next pixel. This sequentiality is essential to generating high quality images, as it allows every pixel to depend in a highly non-linear and multimodal way on the previous pixels. <br />
<br />
Another important mention is that the residual connections are only for horizontal stacks. On the other side skip connections allow as to incorporate features from all layers at the very end of out network. Most important stuff to mention here is that skip and residual connection use different weights after gated block.<br />
<br />
=Conditional PixelCNN=<br />
Conditioning is a smart word for saying that we’re feeding the network some high-level information - for instance, providing an image to the network with the associated classes in MNIST/CIFAR datasets. During training you feed image as well as class to your network to make sure network would learn to incorporate that information as well. During inference you can specify what class your output image should belong to. You can pass any information you want with conditioning, we’ll start with just classes.<br />
<br />
For a conditional PixelCNN, we represent a provided high-level image description as a latent vector $h$, wherein the purpose of the latent vector is to model the conditional distribution $p(x|h)$ such that we get a probability as to if the images suites this description. The conditional PixelCNN models based on the following distribution:<br />
<br />
$$p(x|h) = \prod\limits_{i=1}^{n^2} p(x_i | x_1, ..., x_{i-1}, h)$$<br />
<br />
Hence, now the conditional distribution is dependent on the latent vector h, which is now appended onto the activations prior to the non-linearities; hence the activation function after adding the latent vector becomes:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f}^T h) \odot \sigma(W_{k,g} \ast x + V_{k,g}^T h)$$<br />
<br />
Note $h$ multiplied by matrix inside tanh and sigmoid functions, $V$ matrix has the shape [number of classes, number of filters], $k$ is the layer number, and the classes were passed as a one-hot vector $h$ during training and inference.<br />
<br />
Note that if the latent vector h is a one-hot encoding vector that provides the class labels, which is equivalent to the adding a class dependent bias at every layer. So, this means that the conditioning is independent from the location of the pixel - this is only if the latent vector holds information about “what should the image contain” rather than the location of contents in the image. For instance, we could specify that a certain animal or object should appear in different positions, poses and backgrounds.<br />
<br />
In addition, the authors also developed a variant that makes the conditional distribution dependent on the location (an application when the location of an object is important). This is achieved by mapping the latent vector $h$ to a spatial representation $s=m(h)$ (which contains the same dimension of the image but may have an arbitrary number of feature maps) with a deconvolutional neural network $m()$; this provides a location dependent bias as follows:<br />
<br />
$$y = tanh(W_{k,f} \ast x + V_{k,f} \ast s) \odot \sigma(W_{k,g} \ast x + V_{k,g} \ast s)$$<br />
<br />
where $V_{k,g}\ast s$ is an unmasked $1\times1$ convolution.<br />
<br />
=== PixelCNN Auto-Encoders ===<br />
Since conditional PixelCNNs can model images based on the distribution $p(x|h)$, it is possible to apply this analogy into image decoders used in Autoencoders. Introduced by Hinton et. al in [[#Reference|[5]]], autoencoder is a dimensionality reduction neural network which is composed of two parts: an encoder which maps the input image into low-dimensional representation (i.e. the latent vector $h$) , and a decoder that decompresses the latent vector to reconstruct the original image. <br />
<br />
In order to apply the conditional PixelCNN onto the autoencoder, the deconvolutional decoders are replaced with the conditional PixelCNN - the re-architectured network of which is used for training a data set. The authors observe that the encoder can better extract representations of the provided input data - this is because much of the low-level pixel statistics is now handled by the PixelCNN; hence, the encoder omits low-level pixel statistics and focuses on more high-level abstract information.<br />
<br />
=Experiments=<br />
<br />
===Unconditional Modelling with Gated PixelCNN===<br />
For the first set of experiments, the authors evaluate the Gated PixelCNN unconditioned model on the CIFAR-10 dataset is adopted. A comparison of the validation score between the Gated PixelCNN, PixelCNN, and PixelRNN is computed, wherein the lower score means that the optimized model generalizes better. Using the negative log-likelihood criterion (NLL), the Gated PixelCNN obtains an NLL Test (Train) score of 3.03 (2.90) which outperforms the PixelCNN by 0.11 bits/dim, which obtains 3.14 (3.08). Although the performance is a bit better, visually the quality of the samples that were produced is much better for the Gated PixelCNN when compared to PixelCNN. It is important to note that the Gated PixelCNN came close to the performance of PixelRNN, which achieves a score of 3.00 (2.93). Table 1 provides the test performance of benchmark models on CIFAR-10 in bits/dim (where lower is better), and the corresponding training performance is in brackets.<br />
<br />
[[File:ucond_cifar.png|500px|center|thumb|Table 1: Evaluation on CIFAR-10 dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
Another experiment on the ImageNet data is performed for image sizes $32 \times 32$ and $64 \times 64$. In particular, for a $32 \times 32$ image, the Gated PixelCNN obtains a NLL Test (Train) of 3.83 (3.77) which outperforms PixelRNN which achieves 3.86 (3.83); from which the authors observe that larger models do have better performance, however, the simpler PixelCNN does have the ability to scale better. For a $64 \times 64$ image, the Gated PixelCNN obtains 3.57 (3.48) which, yet again, outperforms PixelRNN which achieves 3.63 (3.57). The authors do mention that the Gated PixelCNN performs similarly to the PixelRNN (with row LSTM); however, Gated PixelCNN is observed to train twice as quickly at 60 hours when using 32 GPUs. The Gated PixelCNN has 20 layers (Figure 2), each of which has 384 hidden units and a filter size of 5x5. For training, a total of 200K synchronous updates were made over 32 GPUs which were computed in TensorFlow using a total batch size of 128. Table 2 illustrates the performance of benchmark models on ImageNet dataset in bits/dim (where lower is better), and the training performance in brackets.<br />
<br />
[[File:ucond_imagenet.png|500px|center|thumb|Table 1: Evaluation on ImageNet dataset for an unconditioned GatedPixelCNN model.]]<br />
<br />
<br />
===Conditioning on ImageNet Classes===<br />
For the the second set of experiments, the authors evaluated the Gated PixelCNN model by conditioning the classes of the ImageNet images. Using the one-hot encoding $(h_i)$, for which the $i^th$ class the distribution becomes $p(x|h_i)$, the model receives roughly log(1000) $\approx$ 0.003 bits/pixel for a $32 \times 32$ image. Although the log-likelihood did not show a significant improvement, visually the quality of the images were generated much better when compared to the original PixelCNN. <br />
<br />
Figure 11 shows some samples from 8 different classes of ImageNet images from a single class-conditioned model. It is evident that the Gated PixelCNN can better distinguish between objects, animals and backgrounds. The authors observe that the model can generalize and generate new renderings from the animal and object class, when the trained model is provided with approximately 1000 images.<br />
<br />
[[File:cond_imagenet.png|500px|center|thumb|Figure 11: Class-Conditional samples from the Conditional PixelCNN on the ImageNet dataset.]]<br />
<br />
<br />
===Conditioning on Portrait Embeddings===<br />
For the third set of experiments, the authors used the top layer of the CNN trained on a large database of portraits that were automatically cropped from Flickr images using face detector. This pre-trained network was trained using triplet loss function which ensured a similar the latent embeddings for particular face across the entire dataset. The motivation of the triplet loss function (Schroff et al.) is to ensure that an image $x^a_i$ (anchor) of a specific person is close to all other images $x^p_i$ (positive) of the same person than it is to any image $x^n_i$ (negative) of any other person. So the tuple loss function is given by<br />
\[<br />
L = \sum_{i} [||h(x^a_i)-h(x^p_i)) ||^2_2- ||h(x^\alpha_i)-h(x^n_i)) ||^2_2 +\alpha ]_+<br />
\]<br />
where $h$ is the embedding of the image $x$, and $\alpha$ is a margin that is enforced between positive and negative pairs.<br />
<br />
In essence, the authors took the latent vector from this supervised pre-trained network which now has the architecture (image=$x$, embedding=$h$) tuples and trained the<br />
Conditional PixelCNN with the latent embeddings to model the distribution $p(x|h)$. Hence, if the network is provided with a face that is not in the training set, the model now has the capability to compute the latent embeddings $h=f(x)$ such that the output will generate new portraits of the same person. Figure 12 provides a pictorial example of the aforementioned manipulated network where it is evident that the generative model can produce a variety of images, independent from pose and lighting conditions, by extracting the latent embeddings from the pre-trained network. <br />
<br />
[[File:cond_portrait.png|500px|center|thumb|Figure 12: Input image is to the lest, whereas the portraits to the right are generated from high-level latent representation.]]<br />
<br />
===PixelCNN Auto Encoder===<br />
For the final set of experiment, the authors venture the possibility to train the a Gated PixelCNN by adopting the Autoencoder architecture. The authors start by training a PixelCNN auto-encoder using $32 \times 32$ ImageNet patches and compared its results to a convolutional autoencoder, optimized using mean-square error. It is important to note that both the models use a 10 or 100 dimensional bottleneck. <br />
<br />
Figure 13 provides a reconstruction using both the models. It is evident that the latent embedding produced when using PixelCNN autoencoder is much different when compared to convolutional autoencoder. For instance, in the last row, the PixelCNN autoencoder is able to generate similar looking indoor scenes with people without directly trying to "reconstruct" the input, as done by the convolutional autoencoder.<br />
<br />
[[File:pixelauto.png|500px|center|thumb|Figure 13: From left to right: original input image, reconstruction by an autoencoder trained with MSE, conditional samples from a PixelCNN as the deconvolution to the autoencoder. It is important to note that both these autoencoders were trained end-to-end with 10 and 100-dimensional bottleneck values.]]<br />
<br />
<br />
=Conclusion=<br />
This work introduced the Gated PixelCNN which is an improvement over the original PixelCNN. In addition to the Gated PixelCNN being more computationally efficient, it now has the ability to match, and in some cases, outperform PixelRNN. In order to deal with the "blind spots" in the receptive fields presented in the PixelCNN, the newly proposed Gated PixelCNN use two CNN stacks (horizontal and vertical filters) to deal with this problem. Moreover, the authors now use a custom-made tank and sigmoid function over the ReLU activation functions because these multiplicative units helps to model more complex interactions. The proposed network obtains a similar performance to PixelRNN on CIFAR-10, however, it is now state-of-the-art on the ImageNet $32 \times 32$ and $64 \times 64$ datasets. <br />
<br />
In addition, the conditional PixelCNN is also explored on natural images using three different settings. When using class-conditional generation, the network showed that a single model is able to generate diverse and realistic looking images corresponding to different classes. When looking at generating human portraits, the model does have the ability to generate new images from the same person in different poses and lightning conditions given a single image. Finally, the authors also showed that the PixelCNN can be used as image decoder in an autoencoder. Although the log-likelihood is quite similar when comparing it to literature, the samples generated from the PixelCNN autoencoder model does provide a high visual quality images showing natural variations of objects and lighting conditions.<br />
<br />
<br />
=Summary=<br />
$\bullet$ Improved PixelCNN called Gated PixelCNN<br />
# Similar performance as PixelRNN, and quick to compute like PixelCNN (since it is easier to parallelize)<br />
# Fixed the "blind spot" problem by introducing 2 stacks (horizontal and vertical)<br />
# Gated activation units which now use sigmoid and tanh instead of ReLU units<br />
<br />
$\bullet$ Conditioned Image Generation<br />
# One-shot conditioned on class-label<br />
# Conditioned on portrait embedding<br />
# PixelCNN AutoEncoders<br />
<br />
=Critique=<br />
# The paper is not descriptive, and does not explain well on how the horizontal and vertical stacks solve the "blindspot" problem. In addition, the authors just mention the "gated block" and how they designed it, but they do not explain the intuition and how this approach is an improvement over the PixelCNN <br />
# The authors do not provide a good pictorial representation on any of the aforementioned novelties<br />
# The PixelCNN AutoEncoder is not descriptive enough! <br />
<br />
=Reference=<br />
# Aaron van den Oord et al., "Pixel Recurrent Neural Network", ICML 2016<br />
# Aaron van den Oord et al., "Conditional Image Generation with PixelCNN Decoders", NIPS 2016<br />
# S. Turukin, "Gated PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html. [Accessed: 15- Nov- 2017].<br />
# S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick and N. Freitas, "Generating interpretable images with controllable structure", 2016.<br />
# G. Hinton, "Reducing the Dimensionality of Data with Neural Networks", Science, vol. 313, no. 5786, pp. 504-507, 2006.<br />
# "Conditional Image Generation with PixelCNN Decoders", Slideshare.net, 2017. [Online]. Available: https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders. [Accessed: 18- Nov- 2017].<br />
# "Gated PixelCNN", Kawahara.ca, 2017. [Online]. Available: http://kawahara.ca/conditional-image-generation-with-pixelcnn-decoders-slides/gated-pixelcnn/. [Accessed: 17- Nov- 2017].<br />
# K. Dhandhania, "PixelCNN + PixelRNN + PixelCNN 2.0 — Commonlounge", Commonlounge.com, 2017. [Online]. Available: https://www.commonlounge.com/discussion/99e291af08e2427b9d961d41bb12c83b. [Accessed: 15- Nov- 2017].<br />
# S. Turukin, "PixelCNN", Sergeiturukin.com, 2017. [Online]. Available: http://sergeiturukin.com/2017/02/22/pixelcnn.html. [Accessed: 17- Nov- 2017].</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Unsupervised_Domain_Adaptation_with_Residual_Transfer_Networks&diff=31332Unsupervised Domain Adaptation with Residual Transfer Networks2017-11-23T21:26:27Z<p>Jdeng: /* Critique */</p>
<hr />
<div>== Introduction ==<br />
'''Domain Adaptation''' [https://en.wikipedia.org/wiki/Domain_adaptation]is a problem in machine learning which involves taking a model which has been trained on a source domain and applying this to a different (but related) target domain. '''Unsupervised domain adaptation''' refers to the situation in which the source data is labeled, while the target data is (predominantly) unlabeled. This scenario arises when we aim at learning from a source data distribution a well-performing model on a different (but related) target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user (the source distribution) to a new one who receives significantly different emails (the target distribution). Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation The problem at hand is then finding ways to generalize the learning on the source domain to the target domain. In the age of deep networks, this problem has become particularly salient due to the need for vast amounts of labeled training data, in order to reap the benefits of deep learning. Manual generation of labeled data is often prohibitive, and in absence of such data networks are rarely performant. The attempt to circumvent this drought of data typically necessitates the gathering of "off-the-shelf" data sets, which are tangentially related and contain labels, and then building models in these domains. The fundamental issue that unsupervised domain adaptation attempts to address is overcoming the inherent shift in distribution across the domains, without the ability to observe this shift directly. The goal of this paper is to simultaneously learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain by embedding the adaptations of both classifiers and features in a unified deep architecture.<br />
<br />
This paper proposes a method for unsupervised domain adaptation which relies on three key components: <br />
# A kernel-based penalty to ensure that the abstract representations generated by the networks hidden layers are similar between the source and the target data; <br />
# An entropy based penalty on the target classifier, which exploits the entropy minimization principle; and <br />
# A residual network structure is appended, which allows the source and target classifiers to differ by a (learned) residual function, thus relaxing the shared classifier assumption which is traditionally made.<br />
<br />
This method outperforms state-of-the-art techniques on common benchmark datasets and is flexible enough to be applied in most feed-forward neural networks.<br />
<br />
[[File:Source-and-Target-Domain-Office-31-Backpack.png|thumb|right|The Office-31 Dataset Images for Backpack. Shows the variation in the source and target domains to motivate why these methods are important.]] <br />
=== Working Example (Office-31) === <br />
In order to assist in the understanding of the methods, it is helpful to have a tangible sense of the problem front of mind. The Domain Adaptation Project [https://people.eecs.berkeley.edu/~jhoffman/domainadapt/] provides data sets which are tailored to the problem of unsupervised domain adaptation. One of these datasets (which is later used in the experiments of this paper) has images which are labeled based on the Amazon product page for the various items. There are then corresponding pictures taken either by webcams or digital SLR cameras. The goal of unsupervised domain adaptation on this data set would be to take any of the three image sources as the source domain, and transfer a classifier to the other domain; see the example images to understand the differences.<br />
<br />
One can imagine that, while it is likely easy to scrape labeled images from Amazon, it is likely far more difficult to collect labeled images from webcam or DSLR pictures directly. The ultimate goal of this method would be to train a model to recognize a picture of a backpack taken with a webcam, based on images of backpacks scraped from Amazon (or similar tasks).<br />
<br />
== Related Work ==<br />
Broadly speaking, the problem of domain adaptation mitigates manual labeling of data in areas such as machine learning, computer vision, and natural language processing. The general goal of domain adaptation is to reduce the discrepancy in probability distributions between the source and target domains.<br />
<br />
Research into the use of Deep Neural Networks for the purpose of domain adaptation has suggested that, while networks learn abstract feature representations which can reduce the discrepancy across domains, it is not possible to wholly remove it [http://www.icml-2011.org/papers/342_icmlpaper.pdf], [https://arxiv.org/pdf/1412.3474.pdf]. Further work has been done to design networks which adapt traditional deep nets (typically CNNs) to specifically address the problems posed by domain adaptation, these methods all only address the issue of feature adaptation [https://arxiv.org/pdf/1502.02791.pdf], [https://arxiv.org/pdf/1409.7495.pdf], [https://people.eecs.berkeley.edu/~jhoffman/papers/Tzeng_ICCV2015.pdf]. That is, they all assume that the target and source classifiers are shared between domains. <br />
<br />
The authors drew particular motivation from He et al. [https://arxiv.org/abs/1512.03385] with the proposed structure of residual networks. Combining the insights from the ResNet architecture, in addition to previous work that had leveraged classifier adaptation (in the context where some target data is labeled) [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.8224&rep=rep1&type=pdf], [http://www.machinelearning.org/archive/icml2009/papers/445.pdf], [http://ieeexplore.ieee.org/document/5539870/] the authors develop their proposed network.<br />
<br />
== Residual Transfer Networks ==<br />
The challenge of unsupervised domain adaptation arises in that the target domain has no labeled data, while the source classifier $f_s$ trained on source domain cannot be directly applied to the target domain due to the distribution discrepancy. Thus, a joint adaptation of features and classifiers can be used to enable effective domain adaptation. This paper presents an end-to-end deep learning framework for classifier adaptation which is harder in the sense that the target domain is fully unlabeled. The authors also propose a method for feature adaptation using Maximum Mean Discrepancy (MMD).<br />
<br />
Generally, in an unsupervised domain adaptation problem, we are dealing with a set $\mathcal{D}_s$ (called the source domain) which is defined by $\{(x_i^s, y_i^s)\}_{i=1}^{n_s}$. That is the set of all labeled input-output pairs in our source data set. We denote the number of source elements by $n_s$. There is a corresponding set $\mathcal{D}_t = \{(x_i^t)\}_{i=1}^{n_t}$ (the target domain), consisting of unlabeled input values. There are $n_t$ such values. <br />
[[File:RTN-Structure.png|thumb|left|upright|The overarching structure of the RTN. Consists of an existing network, to which a bottleneck, MMD block, and residual block is appended.]]<br />
We can think of $\mathcal{D}_s$ as being sampled from some underlying distribution $p$, and $\mathcal{D}_t$ as being sampled from $q$. Generally, we have that $p \neq q$, partially motivating the need for domain adaptation methods. <br />
<br />
We can consider the classifiers $f_s(\underline{x})$ and $f_t(\underline{x})$, for the source domain and target domain respectively. It is possible to learn $f_s$ based on the sample $\mathcal{D}_s$. Under the '''shared classifier assumption''' it would be the case that $f_s(\underline{x}) = f_t(\underline{x})$, and thus learning the source classifier is enough. This method relaxes this assumption, assuming that in general $f_s \neq f_t$, and attempting to learn both.<br />
<br />
The example network extends deep convolutional networks (in this case AlexNet [http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf]) to '''Residual Transfer Networks''', the mechanics of which are outlined below. Recall that, if $L(\cdot, \cdot)$ is taken to be the cross-entropy loss function, then the empirical error of a CNN on the source domain $\mathcal{D}_s$ is given by:<br />
<br />
<center><br />
<math display="block"><br />
\min_{f_s} \frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)<br />
</math> <br />
</center><br />
<br />
In a standard implementation, the CNN optimizes over the above loss. This will be the starting point for the RTN.<br />
<br />
=== Structural Overview ===<br />
The model proposed in this paper extends existing CNN's and alters the loss function that is optimized over. While each of these components is discussed in depth below, the overarching architecture involves four components:<br />
<br />
# An existing deep model. While this can be any model, in theory, the authors leverage AlexNet in practice.<br />
# A bottleneck layer used to reduce the dimensionality of the learned abstract feature space, directly after the existing network.<br />
# An MMD block, with the expressed intention of feature adaptation.<br />
# A residual block, with the expressed intention of classifier adaptation. <br />
<br />
This structure is then optimized for a loss function which combines the standard cross-entropy penalty with MMD and target entropy penalties, yielding the proposed Residual Transfer Network (RTN) structure.<br />
<br />
=== Feature Adaptation ===<br />
Feature adaptation refers to the process in which the features which are learned to represent the source domain are made applicable to the target domain. Broadly speaking a CNN works to generate abstract feature representations of the distribution that the inputs are sampled from. It has been found that using these deep features can reduce, but not remove, cross-domain distribution discrepancy, hence the need for the feature adaptation. It is important to note that CNN's transfer from general to specific features as the network gets deeper. In this light, the discrepancy between the feature representation of the source and the target will grow through a deeper convolutional net. As such a technique for forcing these distributions to be similar is needed.<br />
<br />
In particular, the authors of this paper perform feature adaptation by matching the feature distributions of multiple layers $l ∈ L$ across domains. They impose a bottleneck layer (call it $fc_b$) which is included after the final convolutional layer of AlexNet. This dense layer is connected to an additional dense layer $fc_c$, (which will serve as the target classification layer). They then compute the tensor product between the activations of the layers, performing "lossless multi-layer feature fusion". That is for the source domain they define $z_i^s \overset{\underset{\mathrm{def}}{}}{=} x_i^{s,fc_b}\otimes x_i^{s,fc_c}$ and for the target domain, $z_i^t \overset{\underset{\mathrm{def}}{}}{=} x_i^{t,fc_b}\otimes x_i^{t,fc_c}$. Feature fusion is the process of combining two feature vectors to obtain a single feature vector, which is more discriminative than any of the input feature vectors. The authors then employ feature adaptation by means of Maximum Mean Discrepancy, between the source and target domains, on these fusion features.<br />
<br />
[[File:RTN-MMD-Block.png|right|thumb|The Maximum Mean Discrepancy Block (MMD) included in the RTN. The outputs of $fc_b$ and $fc_c$ are fused through a tensor product, and then passed through the MMD penalty, ensuring distributional similarity.]]<br />
<br />
==== Maximum Mean Discrepancy ==== <br />
In unsupervised learning, we are given independent samples $x_i$ from some underlying data distribution $P$, and our goal is to come up with an approximate distribution $Q$ that is as close to $P$ as possible, only using the samples $x_i$. Often, $Q$ is chosen from a parametric family of distributions ${Q(⋅; \theta), \quad \theta \in \Theta}$, and our goal is to find the optimal parameters $\theta*$ so that the distribution $P$ is best approximated. The central issue of unsupervised learning is choosing an appropriate objective function $l(\theta,P)$, that appropriately measures the quality of our approximation, and which is tractable to compute and optimise when we are working with complicated, deep models. MMD is one such loss function that allowing to match $P$ and $Q$ in unsupervised settings and it is also applicable in supervised domain adaptation scenario as well. <br />
<br />
<br />
Maximum mean discrepancy(MMD) was originally proposed by the [http://dl.acm.org/citation.cfm?id=1859890.1859901 kernel machines community] as a nonparametric way to measure dissimilarity between two probability distributions. MMD is a Kernel method that involves mapping to a Reproducing Kernel Hilbert Space (RKHS) [https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space]. Denote the RKHS $\mathcal{H}_K$ with a characteristic kernel $K$. We then define the '''mean embedding''' of a distribution $p$ in $\mathcal{H}_K$ to be the unique element $\mu_K(p)$ such that $\mathbf{E}_{x\sim p}f(x) = \langle f(x), \mu_K(p)\rangle_{\mathcal{H}_K}$ for all $f \in \mathcal{H}_K$. Now, if we take $\phi: \mathcal{X} \to \mathcal{H}_K$, then we can define the MMD between two distributions $p$ and $q$ as follows:<br />
<br />
<center><br />
<math display="block"><br />
d_k(p, q) \overset{\underset{\mathrm{def}}{}}{=} ||\mathbf{E}_{x\sim p}(\phi(x^s)) - \mathbf{E}_{x\sim q}(\phi(x^t))||_{\mathcal{H}_K}<br />
</math><br />
</center><br />
<br />
Effectively, the MMD will compute the self-similarity of $p$ and $q$, and subtract twice the cross-similarity between the distributions: $\widehat{\text{MMD}}^2 = \text{mean}(K_{pp}) + \text{mean}(K_{qq}) - 2\times\text{mean}(K_{pq})$. From here we can infer that $p$ and $q$ are equivalent distributions if and only if the $\text{MMD} = 0$. If we then wish to force two distributions to be similar, this becomes a minimization problem over the MMD.<br />
<br />
MMD is very similar to the adversarial loss in many ways discussed in much details in [http://www.inference.vc/another-favourite-machine-learning-paper-adversarial-networks-vs-kernel-scoring-rules/ this blog post], however, one striking difference is that maximisation of MMD can be carried out analytically by applying kernel trick, and we obtain the following expression:<br />
<br />
<center><br />
$$<br />
MMD_k(Q,P)=E_{x,x^′~P,P}k(x,x^′) + E_{x,x^′∼Q,Q}k(x,x^′) - 2E_{x,x^′∼P,Q}k(x,x^′)<br />
$$<br />
</center><br />
<br />
In the expression above<br />
<br />
* $E_{x,x^′~P,P}k(x,x^′)$ is constant with respect to Q, so we can just drop it from the objective.<br />
* The second term $E_{x,x^′∼Q,Q}k(x,x^′)$ can be interpreted as an entropy term: minimising this will force $Q$ to be spread out, rather than concentrate on a single set of points<br />
* The third term $E_{x,x^′∼P,Q}k(x,x^′)$ ensures that samples from Q are on average close to samples from P<br />
<br />
Two important notes:<br />
# The RKHS, and as such MMD, depend on the choice of the kernel;<br />
# Computing the MMD efficiently requires an unbiased estimate of the MMD (as outlined [https://arxiv.org/pdf/1502.02791.pdf]).<br />
<br />
==== MMD for Feature Adaptation in the RTN ====<br />
The authors wish to minimize the MMD between the fusion features outlined above derived from the source and target domains. Concretely this amounts to forcing the distribution of the abstract representation of the source domain $\mathcal{D}_s$ to be similar to the distribution of the abstract representation of the target domain $\mathcal{D}_t$. Performing this optimization over the fused features between the $fb_b$ and $fb_c$ forces each of those layers towards similar distributions.<br />
<br />
Practically this involves an additional penalty function given by the following:<br />
<br />
<center><br />
<math display="block"><br />
D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t) = \sum_{i,j=1}^{n_s} \frac{k(z_i^s, z_j^s)}{n_s^2} + \sum_{i,j=1}^{n_t} \frac{k(z_i^t, z_j^t)}{n_t^2} -2 \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \frac{k(z_i^s, z_j^t)}{n_sn_t} <br />
</math><br />
</center><br />
<br />
Where the characteristic kernel $k(z, z')$ is the Gaussian kernel, defined on the vectorization of tensors, with bandwidth parameter $b$. That is: $k(z, z') = \exp(-||vec(z) - vec(z')||^2/b)$.<br />
<br />
=== Classifier Adaptation ===<br />
In traditional unsupervised domain adaptation, there is a '''shared-classifier assumption''' which is made. In essence, if $f_s(x)$ represents the classifier on the source domain, and $f_t(x)$ represents the classifier on the target domain then this assumption simply states that $f_s = f_t$. While this may seem to be a reasonable assumption at first glance, it is problematic largely in that this is an assumption that is incredibly difficult to check. If it could be readily confirmed that the source and target classifiers could be shared, then the problem of domain adaptation would be largely trivialized. Instead, the authors here relax this assumption slightly. They postulate that instead of being equivalent, the source and target classifier differ by some perturbation function $\Delta f$. The general idea is that, by assuming $f_S(x) = f_T(x) + \Delta f(x)$, where $f_S$ and $f_T$ correspond to the source and target classifiers, pre-activation, and $\Delta f(x)$ is some residual function dependent on both the target classifier $f_T(x)$ (due to the functional dependency) as well as the source classifier $f_S(x)$ (due to the back-propagation pipeline). The authors also argue that the perturbation function $\Delta f(x)$ can be learned jointly from the source labeled data and target unlabeled data.<br />
<br />
The authors then suggest using residual blocks, as popularized by the ResNet framework [https://arxiv.org/pdf/1512.03385.pdf], to learn this residual function.<br />
<br />
[[File:Residual-Block-vs-DNN.png|thumb|left|A comparison of a standard Deep Neural Network block which is designed to fit a function H(x) compared to a residual block which fits H(x) as the sum of the input, x, and a learned residual function, F(X).]]<br />
==== Residual Networks Framework ==== <br />
A (Deep) Residual Network, as proposed initially in ResNet, employs residual blocks to assist in the learning process and were a key component of being able to train extraordinarily deep networks. The Residual Network is comprised largely in the same manner as standard neural networks, with one key difference, namely the inclusion of residual blocks - sets of layers which aim to estimate a residual function in place of estimating the function itself. <br />
<br />
That is, if we wish to use a DNN to estimate some function $h(x)$, a residual block will decompose this to $h(x) = F(x) + x$. The layers are then used to learn $F(x)$, and after the layers which aim to learn this residual function, the input $x$ is recombined through element-wise addition, to form $h(x) = F(x) + x$. This was initially proposed as a manner to allow for deeper networks to be effectively trained but has since used in novel contexts.<br />
<br />
==== Residual Blocks in the RTN ====<br />
[[File:RTN-Residual-Block.png|thumb|right|The Structure of the Residual Block in the RTN framework. The block relies on two additional dense layers following the target classifier in an attempt to learn the residual difference between the source and target classifiers.]] The authors leverage residual blocks for the purpose of classifier adaptation. Operating under the assumption that the source and target classifiers differ by an arbitrary perturbation function, $f(x)$, the authors add an additional set of densely connected layers which the source data will flow through. In particular, the authors take the $fc_c$ layer above as the desired target classifier. For the source data an additional set of layers ($fc-1$ and $fc-2$) are added following $fc_c$, which are connected as a residual block. The output of the classifier layer is then added back to the output of the residual block in order to form the source classifier.<br />
<br />
It is necessary to note that in this case, the output from $fc_c$ passes the non-activated (i.e. pre-softmax activation) to the element-wise addition, the result of which is passed through the activation layer, yielding the source prediction. In the provided diagram, we have that $f_S(x)$ represents the non-activated output from the additive layer in the residual block; $f_T(x)$ represents the non-activated output from the target classifier; and $fc-1$/$fc-2$ are used to learn the perturbation function $\Delta f(x)$.<br />
<br />
==== Entropy Minimization ====<br />
In addition to the residual blocks, the authors make use of the '''entropy minimization principle''' [http://www.iro.umontreal.ca/~lisa/pointeurs/semi-supervised-entropy-nips2004.pdf] to further refine the classifier adaptation. In particular, by minimizing the entropy of the target classifier (or more correctly, the entropy of the class conditional distribution $f_j^t(x_i^t) = p(y_i^t = j \mid x_i^t; f_t)$), low-density separation between the classes is encouraged. '''Low-Density Separation''' is a concept used predominantly in semi-supervised learning, which in essence tries to draw class decision boundaries in regions where there are few data points (labeled or unlabeled). The above paper leverages an entropy regularization scheme to achieve the goal low-density separation goal; this is adopted here to the case of unsupervised domain adaptation.<br />
<br />
In practice this amounts to adding a further penalty based on the entropy of the class conditional distribution. In particular, if $H(\cdot)$ is defined to be the entropy function, such that $H(f_t(x_i^t)) = - \sum_{j=1}^c f_j^t(x_i^t)\log f_j^t(x_i^t)$, where $c$ is the number of classes and $f_j^t(x_i^t)$ represents the probability of predicting class $j$ for point $x_i^t$, then over the target domain $\mathcal{D}_t$ we define the entropy penalty to be:<br />
<br />
<center><br />
<math display="block"><br />
\frac{1}{n_t} \sum_{i=1}^{n_t} H(f_t(x_i^t))<br />
</math><br />
</center><br />
<br />
The combination of the residual learning and the entropy penalty, the authors hypothesize will enable effective classifier adaptation.<br />
<br />
=== Residual Transfer Network ===<br />
The combination of the MMD loss introduced in feature adaptation, the residual block introduced in classifier adaptation, and the application of the entropy minimization principle cumulates in the Residual Transfer Network proposed by the authors. The model will be optimized according to the following loss function, which combines the standard cross-entropy, MMD penalty, and entropy penalty:<br />
<br />
<center><br />
<math display="block"><br />
\min_{f_s = f_t + \Delta f} \underbrace{\left(\frac{1}{n_s} \sum_{i=1}^{n_s} L(f_s(x_i^s), y_i^s)\right)}_{\text{Typical Cross-Entropy}} + \underbrace{\frac{\gamma}{n_t}\left(\sum_{i=1}^{n_t} H(f_t(x_i^t)) \right)}_{\text{Target Entropy Minimization}} + \underbrace{\lambda\left(D_{\mathcal{L}}(\mathcal{D}_s, \mathcal{D}_t)\right)}_{\text{MMD Penalty}}<br />
</math><br />
</center><br />
<br />
Where we take $\gamma$ and $\lambda$ to be tradeoff parameters between the entropy penalty and the MMD penalty. As classifier adaptation proposed in this paper and feature adaptation studied in [5, 6] are tailored to adapt different layers of deep networks, they are expected to complement each other and to establish better performance.<br />
<br />
The full network, which is trained subject to the above optimization problem, thus takes on the following structure.<br />
<br />
[[File:rtn-full-paper-structure.png||center|alt=The Structure of the RTN]]<br />
<br />
== Experiments == <br />
<br />
=== Set-up ===<br />
The performance of RTN was jointly compared across two key data sets in the area of Unsupervised Domain Adaptation. Specifically, Office-31 (discussed in the introduction) and Office-Caltech (maintained by the same project group). Office-31 is comprised of images from 3 sources, Amazon ('''A'''), Webcam ('''W'''), and DSLR ('''D'''), of 31 different objects. Office-Caltech is derived by considering 10 classes common to both the Office-31 and the Caltech data sets, thus providing further adaptation possibilities. This provides 6 Transfer Tasks on the 31 classes of Office-31 ($\{(A,W), (A,D), (W,A), (W,D), (D,A), (D,W)\}$) and 12 Transfer Tasks on the 10 classes of Office-Caltech ($\{(A,W), (A,D), (A,C), (W,A), (W,D), (W,C), (D,A), (D,W), (D,C), (C,A), (C,W), (C,D)\}$).<br />
<br />
The authors then compare the results on the 18 different adaptation tasks against 6 other models. In order to determine the efficacy of the various contributions outlined in the paper, they perform an ablation study, evaluating variants of the RTN. Specifically, they consider the RTN with only the MMD module ('''RTN (mmd)'''), the RTN with the MMD module and the entropy minimization ('''RTN (mmd+ent)'''), and the complete RTN ('''RTN (mmd+ent+res)'''). The experiments leverage all the labeled training data and compute accuracy across all unlabeled domain data. The parameters of the model (i.e. $\gamma$, and $\lambda$) are fixed based on a single validation point on the transfer task $\mathbf{A}\to\mathbf{W}$. These parameters are then maintained across all transfer tasks. <br />
<br />
As for specification details, the authors use mini-batch SGD, with momentum $0.9$, and with the learning rate adjusted based on $\eta_p = \frac{\eta_0}{(1 + \alpha p)^\beta}$, where $p$ indicates the portion of training completed (linear from $0$ to $1$), $\eta_0 = 0.01$, $\alpha = 10$ and $\beta = 0.75$, which was optimized for low error on the source. The MMD and entropy parameters, set as above, were maintained at $\lambda = 0.3$ and $\gamma - 0.3$.<br />
<br />
=== Results ===<br />
[[File:table-1-results.PNG|thumb|right|Results from the Office-31 Experiment]][[File:table-2-results.PNG|thumb|right|Results from the Office-Caltech Experiment]]<br />
In aggregate, the network outperformed all comparison methods, across all transfer tasks. Broadly speaking the network saw the largest increases in accuracy on the hard transfer tasks (for instance $\mathbf{A} \to \mathbf{C}$), where the source-domain discrepancy is large. The authors take this to mean that the proposed model learns "more adaptive classifiers and transferable features for safer domain adaptation." They further indicate that standard deep learning techniques (i.e. just AlexNet) perform similarly to standard shallow techniques (TCA and GFK). Deep-transfer methods which focus on feature adaptation perform significantly better than the standard methods. The proposed RTN, which adds in additional considerations for classifier adaptation, performs even better.<br />
<br />
In addition, the ablation study found a number of interesting results:<br />
# The RTN (mmd) outperforms DAN, which is founded on a similar method, but contains multiple MMD penalties (one for each layer instead of on a bottleneck), and is as such less computationally efficient;<br />
# The addition of the entropy penalty [RTN (mmd+ent)] provides significant marginal benefit over the previous RTN (mmd);<br />
# The full RTN [RTN (mmd+ent+res)] performs the best of all variants, by diminishing returns are seen over the addition of the entropy penalty.<br />
<br />
Overall the authors claim that the RTN (mmd+ent+res) is now regarded as state-of-the-art for unsupervised domain adaptation.<br />
<br />
=== Discussion ===<br />
[[File:t-sne-embeddings.png|thumb|left|t-SNE Embeddings Comparing the Performance of DAN and RTN]] <br />
[[File:mean-sd-layer-outputs.png|thumb|right|The Mean and Standard Deviations of the outputs from the Source Classifier, Target Classifier, and Residual Functions. As expected, the residual function provides a small, but non-zero, contribution.]] <br />
[[File:gamma-tradeoff.png|thumb|left|The accuracy of tests by varying the parameter $\gamma$. We first see an increase in accuracy up to an ideal point, before having the accuracy fall again.]]<br />
[[File:classifier-shift.png|thumb|right|The corresponding weights of the classifier layers, if trained on the labeled source and target data, exhibiting the differences which exist between the two classifiers in an ideal state. ]]<br />
<br />
==== Visualizing Predictions (Versus DAN) ====<br />
DAN uses a similar method for feature adaptation but neglects any attempt at classifier adaptation (i.e. it makes the shared-classifier assumption). In order to demonstrate that this leads to the worse performance, the authors provide images showing the t-SNE embeddings by DAN and RTN on the transfer task $\mathbf{A} \to \mathbf{W}$. t-SNE is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot [18]. The images show that the target categories are not well discriminated by the source classifier, suggesting a violation of the shared-classifier assumption. Conversely, the target classifier for the RTN exhibits better discrimination.<br />
<br />
==== Layer Responses and Classifier Shift ==== <br />
The authors further consider the mean and standard deviation of the outputs of $f_S(x)$, $f_T(x)$ and $\Delta f(x)$ to consider the relative contributions of the different components. As expected, $\Delta f(x)$ provides a small (though non-zero) contribution to the learned source classifier. This provides some merit to the idea of residual learning on the classifiers. <br />
<br />
In addition, the authors train classifiers on the source and target data, with labels present, and compare the realized weights. This is used to test how different the ideal weights are on separate classifiers. The results suggest that there is, in fact, a discrepancy between the classifiers, further motivating the use of tactics to avoid the shared-classifier assumption. <br />
<br />
==== Parameter Sensitivity ==== <br />
Lastly, the authors test the sensitivity of these results against the parameter $\gamma$. They run this test on $\mathbf{A}\to\mathbf{W}$ in addition to $\mathbf{C}\to\mathbf{W}$, varying the parameter from $0.01$ to $1.0$. They find that, on both tasks, the increase of the parameter initially improves accuracy, before seeing a drop-off.<br />
<br />
== Conclusion ==<br />
This paper presented a novel approach to unsupervised domain adaptation which relaxed assumptions made by previous models with regard to the shared nature of classifiers. Emphasis on this paper is portrayed on unsupervised domain adaptation and on mismatches between the source-target classification results (i.e. the marginal distribution difference of source and target). The proposed deep residual network learns through the perturbation function, which is created through the difference of classifiers. The deep residual network also has the ability to couple the feature learning and feature adaptation to minimize the marginal distribution shift. <br />
<br />
Like previous models, this proposed network leverages feature adaptation by matching the distributions of features across the domains. In addition, using a residual network and entropy minimization tactic, the target classifier is allowed to differ from the source classifier by implementing a new residual transfer module as the bridge. In particular, this approach allows for easy integration into existing networks and can be implemented with any standard deep learning software.<br />
<br />
For follow-up considerations, the authors propose looking for adaptations which may be useful in the semi-supervised domain adaptation problem.<br />
<br />
== Critique ==<br />
While the paper presents a clear approach, which empirically attains great results on the desired tasks, I question the benefit to the residual block that is employed. The results of the ablation study seem to suggest that the majority of the benefits can be derived from using the MMD and Entropy penalties. The residual block appears to add marginal, perhaps insignificant contributions to the outcome. (In practical applications, there is no<br />
guarantee that the source classifier and target classifier can be safely shared. So the residual transfer of classifier layers is critical. ) Despite this, the use of MMD loss is not novel, and the entropy loss is less well documented, and less thoroughly explored. Perhaps a different set of ablations would have indicated that the three parts, indeed, are equally effective (and the diminishing returns stems from stacking the three methods), but as it is presented, I question the utility of the final structure versus a less complicated, less novel approach. The authors do not evaluate their results in terms of <math> \mathcal{H}\Delta\mathcal{H} </math> which defines a discrepancy distance [6] between two distributions <math> \mathcal{S} </math> (source distribution) and <math> \mathcal{T} </math> (target distribution) w.r.t. a hypothesis set <math> \mathcal{H} </math>. Using it, we can obtain a probabilistic bound [19] on the performance εT (h) of some classifier h from T evaluated on the target domain given its performance εS (h) on the source domain.<br />
<br />
The same authors have further improved on their methods since the release of the present paper [20]. Their latest approach uses joint adaptation networks. The network processes source and target domain data using CNNs. The joint distributions of these activations are then aligned. The authors claim that this method yields state of the art results with a simpler training procedure.<br />
<br />
One thing that authors assume is that the feature map for both source and target distributions are the same, and they just differ in the classifier part. This has to be made more clear.<br />
<br />
==References==<br />
# https://en.wikipedia.org/wiki/Domain_adaptation<br />
# https://people.eecs.berkeley.edu/~jhoffman/domainadapt/<br />
# Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.<br />
# Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).<br />
# Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.<br />
# Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised domain adaptation by backpropagation." International Conference on Machine Learning. 2015.<br />
# Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.<br />
# Yang, Jun, Rong Yan, and Alexander G. Hauptmann. "Cross-domain video concept detection using adaptive svms." Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007.<br />
# Duan, Lixin, et al. "Domain adaptation from multiple sources via auxiliary classifiers." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.<br />
# Duan, Lixin, et al. "Visual event recognition in videos by learning from web data." IEEE Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012): 1667-1680.<br />
# http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf<br />
# https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space<br />
#Long, Mingsheng, et al. "Learning transferable features with deep adaptation networks." International Conference on Machine Learning. 2015.<br />
#He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.<br />
# Grandvalet, Yves, and Yoshua Bengio. "Semi-supervised learning by entropy minimization." Advances in neural information processing systems. 2005.<br />
# More information on residual functions https://www.youtube.com/watch?v=urAp0DibYlY <br />
# Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of Machine Learning Research 9.Nov (2008): 2579-2605.<br />
#Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Alex, Pereira, Fernando, and Vaughan, Jennifer Wortman. A theory of learning from different domains. JMLR, 79, 2010.<br />
# M. Long, H. Zhu, J. Wang, M. I. Jordan. Deep Transfer Learning with Joint Adaptation Networks. Proceedings of the 34th International Conference on Machine Learning. 2017.<br />
<br />
Expert review from the NIPS community can be found in https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/99.html.<br />
<br />
Implementation Example: https://github.com/thuml/Xlearn</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Conditional_Image_Synthesis_with_Auxiliary_Classifier_GANs&diff=31320Conditional Image Synthesis with Auxiliary Classifier GANs2017-11-23T19:13:14Z<p>Jdeng: /* Model */</p>
<hr />
<div>'''Abstract:''' "In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data." [[#References | (Odena et al., 2016)]]<br />
<br />
= Introduction =<br />
<br />
=== Motivation ===<br />
The authors introduce a GAN architecture for generating high resolution images from the ImageNet dataset. They show that this architecture makes it possible to split the generation process into many sub-models. They further suggest that GANs have trouble generating globally coherent images, and that this architecture is responsible for the coherence of their samples. They experimentally demonstrate that generating higher resolution images allow the model to encode more class-specific information, making them more visually discriminable than lower resolution images even after they have been resized to the same resolution.<br />
<br />
The second half of the paper introduces metrics for assessing visual discriminability and diversity of synthesized images. The discussion of image diversity in particular is important due to the tendency for GANs to 'collapse' to only produce one image that best fools the discriminator [[#References|(Goodfellow et al., 2014)]].<br />
<br />
=== Previous Work ===<br />
<br />
Of all image synthesis methods (e.g. variational autoencoders, autoregressive models, invertible density estimators), GANs have become one of the most popular and successful due to their flexibility and the ease with which they can be sampled from. A standard GAN framework pits a generative model $G$ against a discriminative adversary $D$. The goal of $G$ is to learn a mapping from a latent space $Z$ to a real space $X$ to produce examples (generally images) indistinguishable from training data. The goal of the $D$ is to iteratively learn to predict when a given input image is from the training set or a synthesized image from $G$. Jointly the models are trained to solve the game-theoretical minimax problem, as defined by [[#References|Goodfellow et al. (2014)]]: $$\underset{G}{\text{min }}\underset{D}{\text{max }}V(G,D)=\mathbb{E}_{X\sim p_{data}(x)}[log(D(X))]+\mathbb{E}_{Z\sim p_{Z}(z)}[log(1-D(G(Z)))]$$<br />
<br />
While this initial framework has clearly demonstrated great potential, other authors have proposed changes to the method to improve it. Many such papers propose changes to the training process [[#References|(Salimans et al., 2016)]][[#References|(Karras et al., 2017)]], which is notoriously difficult for some problems. Others propose changes to the model itself. [[#References|Mirza & Osindero (2014)]] augment the model by supplying the class of observations to both the generator and discriminator to produce class-conditional samples. According to [[STAT946F17/Conditional Image Generation with PixelCNN Decoders|van den Oord et al. (2016)]], conditioning image generation on classes can greatly improve their quality. Other authors have explored using even richer side information in the generation process with good results [[Learning What and Where to Draw|(Reed et al., 2016)]].<br />
<br />
Another model modification relevant to this paper is to force the discriminator network to reconstruct side information by adding an auxiliary network to classify generated (and real) images. The authors make the claim that forcing a model to perform additional tasks is known to improve performance on the original task [[#References|(Szegedy et al., 2014)]][[#References|(Sutskever et al., 2014)]][[#References|(Ramsundar et al., 2016)]]. They further suggest that using pre-trained image classifiers (rather than classifiers trained on both real and generated images) could improve results over and above what is shown in this paper.<br />
<br />
= Contributions =<br />
<br />
The contributions of this paper are in three main areas. First, the authors propose slight changes to previously existing GAN architectures, resulting in a model capable of generating samples of impressive quality. Second, the authors propose two metrics to assess the quality of samples generated from a GAN. Lastly, they present empirical results on GANs which are of some interest.<br />
<br />
== Model ==<br />
<br />
The authors propose an auxiliary classifier GAN (AC-GAN) which is a slight variation on previous architectures. Like [[#References|Mirza & Osindero (2014)]], the generator takes the image class to be generated as input in addition to the latent encoding $Z$. Like [[#References|Odena (2016)]] and [[#References|Salimans et al. (2016)]], the discriminator is trained to predict not only whether an observation is real or fake, but to classify each observation as well. The marginal contribution of this paper is to combine these in one model.<br />
<br />
Formally, let $C\sim p_c$ represent the target class label of each generated observation and $Z$ represent the usual noise vector from the latent space. Then the generator function takes both as argument to produce image samples: $X_{fake}=G(c,z)$.The discriminator gives a probability distribution over the source $S$ (real or fake) of the image as well as the class label $C$ being generated. $$D(X)=<P(S|X),P(C|X)>$$<br />
<br />
The objective function for the model thus has two parts, one corresponding to the source $L_S$ and the other to the class $L_C$. $D$ is trained to maximize $L_S + L_C$, while $G$ is trained to maximize $L_C-L_S$. Using the notation of Goodfellow et al. (2014), the loss terms are defined as:<br />
$$L_S=\mathbb{E}_{X\sim p_{data}(x)}[log(P(S=real|X))]+\mathbb{E}_{C,Z\sim p_{C,Z}(c,z)}[log(P(S=fake|G(C,Z)))]$$<br />
$$L_C=\mathbb{E}_{X\sim p_{data}(x)}[log(P(C|X))]+\mathbb{E}_{C,Z\sim p_{C,Z}(c,z)}[log(P(C|G(C,Z)))]$$<br />
<br />
Because G accepts both $C$ and $Z$ as arguments, it is able to learn a mapping $Z\rightarrow X$ that is independent of $C$. The authors argue that all class-specific information should be represented by $C$, allowing $Z$ to represent other factors such as pose, background, etc.<br />
<br />
Lastly the authors split the generation process into many class-specific submodels. They point out that the structure of their model permits this split, though it should technically be possible for even the standard GAN framework by dividing the training data into groups according to their known class labels.<br />
<br />
The changes above result in a model capable of generating (some) image samples with both high resolution and global coherence.<br />
<br />
== GAN Quality Metrics ==<br />
<br />
A much larger part of the paper is spent on measuring the quality of a GAN's output. As the authors say, evaluating a generative model's quality is difficult due to a large number of probabilistic measures (such as average log-likelihood, Parzen window estimates, and visual fidelity [[#References| (Theis et al., 2015)]]) and "a lack of a perceptually meaningful image similarity metric".<br />
<br />
=== Image Discriminability Metric ===<br />
The authors develop two metrics in this paper to address these shortcomings. The first of these is a discriminability metric, the goal of which is to assess the degree to which generated images are identifiable as the class they are meant to represent. Ideally a team of non-expert humans could handle this, but the difficulty of such an approach makes the need for an automated metric apparent. The metric proposed by the authors is to measure the accuracy of a pre-trained image classifier trained on the pristine training data. For this they select a [https://github.com/openai/improved-gan/ modified version of Inception-v3][[#References|(Szegedy et al., 2015)]].<br />
<br />
Other metrics already exist for assessing image quality, the most popular of which is probably the Inception Score [[#References| (Salimans et al., 2016)]]. The authors list two main advantages of their approach over the Inception Score. The first is that accuracy figures are easier to interpret than the Inception Score, which is fairly self-evident. The second advantage of using Inception accuracy instead of Inception Score is that Inception accuracy may be calculated for individual classes, giving a better picture of where the model is strong and where it is weak.<br />
<br />
=== Image Diversity Metric ===<br />
<br />
The second metric proposed by the authors measures the diversity of the generated images. As mentioned above, image diversity is an important quality in a GAN, a common failure they suffer from is 'collapsing', where the generator learns to only output one image that is good at fooling the discriminator [[#References|(Goodfellow et al., 2014)]][[#References|(Salimans et al., 2016)]]. The metric proposed in this section is intended to be complementary to the Inception accuracy, as Inception accuracy would not detect generator collapse.<br />
<br />
For their diversity metric, the authors co-opt an existing metric used to measure the similarity between two images: multi-scale structural similarity (MS-SSIM)[[#References|(Wang et al., 2003)]]. The authors do not go into detail about how the MS-SSIM measure is calculated, except to say that it is one of the more successful ways to predict human perceptual similarity judgements. It can take values on the interval $[0,1]$, and higher values indicate the two images being compared a perceptually more similar. For images generated from a GAN then, the metric should ideally be low, as diversity is desired.<br />
<br />
The authors contribution is to use this metric to assess diversity of a GAN's output. It is a pairwise comparison between two images, so their solution is to compare 100 images (that is $100\cdot 99$ paired comparisons) from each class and take the mean MS-SSIM score.<br />
<br />
The authors make two points about their use of this metric. First, the way they apply the metric is different from how it was originally intended to be used. It is possible that it will not behave as desired because of this. As evidence to the contrary, they state that:<br />
# Visually the metric seems to work. Pairs with high MS-SSIM seem more similar.<br />
# Comparisons are only made between images in the same class, keeping their application of the metric closer to its original use case of measuring the quality of compression algorithms.<br />
# The metric is not saturated. Scores on their generated data vary across the unit interval. If scores were all very close to zero the metric would not be much use.<br />
<br />
The second point they raise is that the mean MS-SSIM metric is not intended as a proxy for entropy of the generator distribution in pixel space. That measure is hard to compute, and in any case is sensitive to trivial changes in the pixels, whereas the true intention of this metric is to measure perceptual similarity.<br />
<br />
== Experimental Results on GAN Properties ==<br />
<br />
[[File:Figure_2_(Bottom).JPG|thumb|500px|right|alt=(Odena et al., 2016) Figure 2: (Left) Inception accuracy (y-axis) of two generators with resolution 128 x 128 (red) and 64 x 64 (blue). Images are resized to the same spatial resolution (x-axis). (Right) Class accuracies from the 128 x 128 AC-GAN at full resolution (x-axis) and downsized to 32 x 32 (y-axis).|(Odena et al., 2016) Figure 2: (Left) Inception accuracy (y-axis) of two generators with resolution 128 x 128 (red) and 64 x 64 (blue). Images are resized to the same spatial resolution (x-axis). (Right) Class accuracies from the 128 x 128 AC-GAN at full resolution (x-axis) and downsized to 32 x 32 (y-axis).]]<br />
<br />
The authors conduct several experiments on their model and proposed metrics. These are summarized in this section.<br />
<br />
=== Higher Resolution Images are more Discriminable ===<br />
<br />
As it is one of the main attractions of this paper, the authors investigate how generating samples at different resolutions affects their discriminability. To achieve this two models are trained, one that generates 64 x 64 resolution images and one that generates 128 x 128 resolution images. These images can be rescaled using bilinear interpolation to make them directly comparable. The authors find that the 128 x 128 AC-GAN (on average) achieves higher discriminability (per the Inception accuracy metric [[#Image Discriminability Metric|introduced above]]) at all resized resolutions. The authors claim that theirs is the first work to investigate how much an image generator is 'making use of its given output resolution'.<br />
<br />
=== Generated Images are both Diverse and Discriminable ===<br />
<br />
A second experiment conducted in the paper aims to investigate how the two metrics they propose interact with each other. They simply calculate the Inception accuracy and mean MS-SSIM score for a batch of generated images from every class in their data and report the correlation between the scores. They find the scores are anti-correlated with $\rho=-0.16$. Because the mean MS-SSIM metric is low for diverse samples, they conclude that accuracy and diversity are actually positively correlated, which is contradictory to the hypothesis that GANs that collapse achieve better sample quality.<br />
<br />
[[File:Figure_10.JPG|thumb|300px|right|alt=(Odena et al., 2016) Figure 10: Mean MS-SSIM scores for 10 ImageNet classes (y-axis) plotted against the number of classes handled by each sub-model (x-axis).|(Odena et al., 2016) Figure 10: Mean MS-SSIM scores for 10 ImageNet classes (y-axis) plotted against the number of classes handled by each sub-model (x-axis).]]<br />
<br />
=== Effect of Class Splits on Image Sample Quality ===<br />
<br />
The final experiment conducted in the paper investigates how the number of classes a fixed GAN model has to generate impacts the diversity of the generated images. In their main model the authors split the data such that each AC-GAN only had to generate images from 10 classes. Here they experimented with changing the number of classes each sub-model has to generate while holding the architecture of the sub-models fixed. They report the mean MS-SSIM score of the generated images from the original classes (when the model had only 10 to generate) at each split level. Perhaps unsurprisingly, the diversity of the outputs drops as the number of classes the model has to handle increases. Giving the model more parameters to handle the larger (and more diverse) set of classes might possibly eliminate this effect.<br />
<br />
Finally, they state that they were unable to get a regular GAN to converge on the task of generating images from one class per model. This could be due to the difficulty of training GANs and the limited amount of training data available for each class rather than any theoretical property of class splits.<br />
<br />
= Results =<br />
<br />
=== Model ===<br />
The authors apply their AC-GAN model to the ImageNet [[#References|(Russakovsky et al., 2015)]] dataset. The data have 1000 classes which the authors split into 100 groups of 10. An AC-GAN model is trained on each group of 10 to give results reported for the paper. The authors give some examples of the images generated from this setup. They note that these are selected to show the success of the model, not give a balanced representation of how good it is:<br />
<br />
[[File:Figure_1_AC-GAN.JPG|thumb|1000px|center|alt=(Odena et al., 2016) Figure 1: Selected images generated by the AC-GAN model for the ImageNet dataset.|(Odena et al., 2016) Figure 1: Selected images generated by the AC-GAN model for the ImageNet dataset.]]<br />
<br />
The apparent value of the model is its ability to generate high-resolution samples with global coherence. From the images given above one must acknowledge they are impressively realistic. The authors link to a [https://photos.google.com/share/AF1QipPzTToH3wxrKoF8l5nWvSNz7D_oS-KB3YMQ6Ji-4XK3AtgJmlb5QCdqRQLqLSjfkw?key=YnhJb2UyMnZwb01oZ2xhUjBraE9fdWU1VVpNZTVB full set] of images generated from every ImageNet class as well. As one would expect from their acknowledgement that images above are selected to be most impressive, not all samples at the linked page are quite so coherent.<br />
<br />
[[File:Figure_9_(Bottom).JPG|thumb|400px|right|alt=(Odena et al., 2016) Figure 9 (Bottom): Each column is a different class. Each row is generated by a different latent encoding $z$.|(Odena et al., 2016) Figure 9 (Bottom): Each column is a different class. Each row is generated by a different latent encoding $z$.]]<br />
<br />
The authors compare their model with state-of-the-art results from [[#References| Salimans et al. (2016)]] on the CIFAR-10 dataset at a 32 x 32 resolution. To score the two models they use Inception Score instead of log-likelihood, which they claim is too inaccurate to be reported. Their model achieves a score of $8.25 \pm 0.07$ versus the previous state-of-the-art of $8.09 \pm 0.07$.<br />
<br />
[[#References|Odena et al. (2016)]] argue that the class conditional generator allows $G$ to learn a representation of $Z$ independent of $C$ in section 3, and give evidence of the claim later in section 4.5 by showing that images generated with a fixed latent vector $z$ but different class labels $c$ have similar global structure (e.g. orientation of the subject) but the subjects (bird species) vary according to the label. Interestingly, the background (especially in the top row) also varies with the class label. This can possibly be attributed to the bird species coming from different areas, hence a seagull might be expected to have an ocean background. Clearly here the model benefits from the fact that the authors grouped similar classes together. A more interesting analysis might show the same comparison between different classes, such as birds and forklifts, to see how global structure is encoded across them.<br />
<br />
The authors also include a discussion of whether their model is overfitting the training data. Their first test is to find the nearest neighbour in the training set of a generated image by the L1 measure in pixel space and visually compare the two images. This is a fairly naive approach, since the L1 loss in pixel space is extremely unlikely to identify whether two images are perceptually similar. Here would have been a good place to use the MS-SSIM metric to identify the nearest neighbours, since it is intended to measure perceptual similarity. The images they provide from this analysis are below.<br />
<br />
[[File:Figure_8.JPG|thumb|500px|center|alt=(Odena et al., 2016) Figure 8: Images generated by the AC-GAN model (top) and their nearest neighbour (L1 measure in pixel space) in the ImageNet dataset (bottom).|(Odena et al., 2016) Figure 8: Images generated by the AC-GAN model (top) and their nearest neighbour (L1 measure in pixel space) in the ImageNet dataset (bottom).]]<br />
<br />
A second check they make is that interpolating between two generated images in the latent space does not result in any discrete transitions or holes in the image interpolation. Such results would be indicative of overfitting. The <br />
images they give as evidence that this is not the case are below.<br />
<br />
[[File:Figure_9_(Top).JPG|thumb|1000px|center|alt=(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.|(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.]]<br />
<br />
Moreover, another way to study the overfitting problem is to explore the latent space affect on the AC-GAN by exploiting the structure of the model. The information representation in AC-GAN includes class information<br />
and a class-independent latent representation $z$. Sampling network with $z$ fixed but altering the class label corresponds to generating samples with the same ‘style’ across multiple classes. Figure 9 (Bottom) shows the class changes for each column, elements of the global structure (e.g. position, layout, background) are preserve.<br />
<br />
[[File:Fig 9 bottom.png|thumb|1000px|center|alt=(Odena et al., 2016) Figure 9 (Bottom): Class-independent information contains global structure about the synthesized image.|(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.]]<br />
<br />
=== Image Diversity Metric ===<br />
<br />
Another result the authors report is the performance of their image diversity metric. It is difficult to evaluate quantitatively, but visually we see that the scores do appear to capture the perceptual diversity of the generated class. For example the 'artichoke' generator appears to have collapsed, and has a high score, while the 'promontory' generator seems fairly diverse and has a low score:<br />
<br />
[[File:Figure_3.JPG|thumb|700px|center|alt=(Odena et al., 2016) Figure 3: Mean MS-SSIM scores and sample images from selected generated (top) and real (bottom) ImageNet classes.|(Odena et al., 2016) Figure 3: Mean MS-SSIM scores and sample images from selected generated (top) and real (bottom) ImageNet classes.]]<br />
<br />
= Critique =<br />
=== Model ===<br />
<br />
A major attraction of this paper is the impressive quality of samples generated by the model. GANs often generate samples that are locally plausible but globally not realistic (e.g. a generated image of a dog has fur but the overall shape is not distinguishable). As we have seen in this critique, and as acknowledged by the authors, the most impressive samples are not representative of the model's overall performance.<br />
<br />
The model itself is not a very big advancement of the field. It combines two ideas that are both already prevalent in the research without any other justification than that it seems like a natural thing to do. As [https://openreview.net/forum?id=BkDDM04Ke other reviewers] have noted, investigating how much value the proposed model adds by comparing it with other models that only implement one (or neither) of the changes would have made this paper a slightly more interesting read.<br />
<br />
Another criticism I have about the paper is about how they report their results. To compare with [[#References|Salimans et al. (2016)]] they use Inception Score rather than log-likelihood, which they claim is the standard. Even if their model performed worse by that measure it ought to be included with the caveat they mentioned. The models are evaluated on a different dataset and at a lower spatial resolution than was used for the rest of the paper. By the Inception Score their results are better on average but might not be significantly different given how close they are. Finally, they did not apply the mean MS-SSIM score they developed in this paper to evaluate their model against [[#References|Salimans et al. (2016)]]. This would have been a natural point to make, but instead they generate four samples from each model as their evidence.<br />
<br />
An analysis the authors could have included that was touched upon but not explored in section 4.6 of the paper, and in the [[#Results|Results]] section of this summary, is how the similarity of the classes grouped in each sub-model impact the quality of generated samples. The example I gave above was to compare generated images with the same latent code but very different classes, such as birds and forklifts, to see how the global structure transferred across dissimilar classes.<br />
<br />
A last point to make about the model section is that the authors make some unsupported claims in their discussions of the model's properties. Specifically, they state that their modification to the standard GAN formulation appears to stabilize training but offer no evidence. Another example is their claim that "AC-GANs learn a representation for $z$ that is independent of class label". They site [[#References|Kingma et al (2014)]] as evidence of this. From my review of that paper it does not appear that the authors gave evidence for such a claim.<br />
<br />
=== GAN Quality Metrics ===<br />
<br />
The Inception accuracy metric proposed in this paper has the drawback that it is only applicable in a conditional GAN setting, since the standard GAN framework has no ground-truth labels. It is also true that using a pre-trained classifier is only a proxy for determining how much generated images look like the class they are meant to represent, since classifiers are not perfect. Consider the phenomenon of adversarial attacks on classifiers to see this point. However the advantages the authors list, that the Inception accuracy can be computed on a per-class basis and is easier to interpret than the Inception Score do have merit. The metric does make sense for the task the authors use it for.<br />
<br />
The same can be said for the mean MS-SSIM metric developed in this paper. Visually it appears to be a good indicator of diversity in the GAN's outputs. The authors claim that the mean MS-SSIM is a fast and easy-to-compute metric for perceptual variability and collapsing behaviour in a GAN. It is unclear how fast the metric can be computed since for each class the MS-SSIM has to be computed 100*99 times, once for each pair of images. The authors do not discuss how quickly it can actually be done.<br />
<br />
=== Experimental Results on GAN Properties ===<br />
<br />
The authors included three analyses which I have termed experiments. Of these, the first concluded that images generated at a higher resolution are more discriminable than images generated at lower resolutions, even when they have been resized to be comparable. This does not seem like a very revolutionary conclusion. For one thing, the space of lower resolution images in contained in the higher resolution space. In essence the high resolution model could generate lower resolution images by setting blocks of 4 pixels to the same intensity. It seems unsurprising then that the lower resolution is less discriminable on average. Another reason for this could be that the high resolution model has more parameters, and is trained on higher resolution data, so it has more information with which to reconstruct class information. Finally the authors give a graph of accuracies to show this property, and the average line appears compelling, however the standard errors about the lines suggest they may not be significantly different.<br />
<br />
The second experiment is on the interaction between the Inception accuracy and mean MS-SSIM metric. The author found that they are negatively correlated, and thus that classes that are high quality also tend to be diverse. This is contrary to prevailing wisdom, and since the correlation between them is weak, it appears that it may be only a fluke of the metrics.<br />
<br />
The final experiment is on the effect of class splits on image diversity. The authors found that increasing the number of classes handled by each model reduced the diversity of generated images. They make the claim at the beginning of the paper that they show the number of classes is what makes ImageNet synthesis difficult for GANs. This analysis does point in that direction but is not quite conclusive about the issue. Another analysis they could have included towards showing this is how their Inception accuracy metric and the Inception Score are affected by the number of class splits in their model.<br />
<br />
= Conclusion =<br />
<br />
This paper's main contributions were to introduce a slight variation on previous GAN models, as well as two metrics that can be used to assess the quality of generated images. The modified GAN, dubbed the Auxiliary Classifier GAN, was shown to produce high quality, high resolution samples from ImageNet, but not consistently The authors could have done more to show why their proposed architecture was an improvement over previous methods.<br />
<br />
The metrics introduced are both fairly straightforward and appear to function as they are intended. This being said, the authors could have used them more consistently throughout the paper (such as using the MS-SSIM to find nearest neighbours instead of the L1 pixel space loss). This paper was accepted to ICML 2017 but rejected by ICLR 2018 due to the incremental nature of the model development and the ad hoc nature of the other analyses presented above.<br />
<br />
= References =<br />
# Odena, A., Olah, C., & Shlens, J. (2016). Conditional image synthesis with auxiliary classifier gans. arXiv preprint [http://proceedings.mlr.press/v70/odena17a.html arXiv:1610.09585].<br />
# Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).<br />
# Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems (pp. 2234-2242).<br />
# Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv preprint [https://arxiv.org/abs/1710.10196 arXiv:1710.10196].<br />
# Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint [https://arxiv.org/abs/1411.1784 arXiv:1411.1784].<br />
# van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A. (2016). Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
# Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning what and where to draw. In Advances in Neural Information Processing Systems (pp. 217-225).<br />
# Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).<br />
# Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).<br />
# Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., & Pande, V. (2015). Massively multitask networks for drug discovery. arXiv preprint [https://arxiv.org/abs/1502.02072 arXiv:1502.02072]<br />
# Odena, A. (2016). Semi-supervised learning with generative adversarial networks. arXiv preprint [https://arxiv.org/abs/1606.01583 arXiv:1606.01583].<br />
# Theis, L., Oord, A. V. D., & Bethge, M. (2015). A note on the evaluation of generative models. arXiv preprint [https://arxiv.org/abs/1511.01844 arXiv:1511.01844].<br />
# Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003, November). Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on (Vol. 2, pp. 1398-1402). IEEE.<br />
# Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Fig_9_bottom.png&diff=31319File:Fig 9 bottom.png2017-11-23T19:12:16Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Conditional_Image_Synthesis_with_Auxiliary_Classifier_GANs&diff=31318Conditional Image Synthesis with Auxiliary Classifier GANs2017-11-23T19:10:35Z<p>Jdeng: /* Results */</p>
<hr />
<div>'''Abstract:''' "In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128×128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128×128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data." [[#References | (Odena et al., 2016)]]<br />
<br />
= Introduction =<br />
<br />
=== Motivation ===<br />
The authors introduce a GAN architecture for generating high resolution images from the ImageNet dataset. They show that this architecture makes it possible to split the generation process into many sub-models. They further suggest that GANs have trouble generating globally coherent images, and that this architecture is responsible for the coherence of their samples. They experimentally demonstrate that generating higher resolution images allow the model to encode more class-specific information, making them more visually discriminable than lower resolution images even after they have been resized to the same resolution.<br />
<br />
The second half of the paper introduces metrics for assessing visual discriminability and diversity of synthesized images. The discussion of image diversity in particular is important due to the tendency for GANs to 'collapse' to only produce one image that best fools the discriminator [[#References|(Goodfellow et al., 2014)]].<br />
<br />
=== Previous Work ===<br />
<br />
Of all image synthesis methods (e.g. variational autoencoders, autoregressive models, invertible density estimators), GANs have become one of the most popular and successful due to their flexibility and the ease with which they can be sampled from. A standard GAN framework pits a generative model $G$ against a discriminative adversary $D$. The goal of $G$ is to learn a mapping from a latent space $Z$ to a real space $X$ to produce examples (generally images) indistinguishable from training data. The goal of the $D$ is to iteratively learn to predict when a given input image is from the training set or a synthesized image from $G$. Jointly the models are trained to solve the game-theoretical minimax problem, as defined by [[#References|Goodfellow et al. (2014)]]: $$\underset{G}{\text{min }}\underset{D}{\text{max }}V(G,D)=\mathbb{E}_{X\sim p_{data}(x)}[log(D(X))]+\mathbb{E}_{Z\sim p_{Z}(z)}[log(1-D(G(Z)))]$$<br />
<br />
While this initial framework has clearly demonstrated great potential, other authors have proposed changes to the method to improve it. Many such papers propose changes to the training process [[#References|(Salimans et al., 2016)]][[#References|(Karras et al., 2017)]], which is notoriously difficult for some problems. Others propose changes to the model itself. [[#References|Mirza & Osindero (2014)]] augment the model by supplying the class of observations to both the generator and discriminator to produce class-conditional samples. According to [[STAT946F17/Conditional Image Generation with PixelCNN Decoders|van den Oord et al. (2016)]], conditioning image generation on classes can greatly improve their quality. Other authors have explored using even richer side information in the generation process with good results [[Learning What and Where to Draw|(Reed et al., 2016)]].<br />
<br />
Another model modification relevant to this paper is to force the discriminator network to reconstruct side information by adding an auxiliary network to classify generated (and real) images. The authors make the claim that forcing a model to perform additional tasks is known to improve performance on the original task [[#References|(Szegedy et al., 2014)]][[#References|(Sutskever et al., 2014)]][[#References|(Ramsundar et al., 2016)]]. They further suggest that using pre-trained image classifiers (rather than classifiers trained on both real and generated images) could improve results over and above what is shown in this paper.<br />
<br />
= Contributions =<br />
<br />
The contributions of this paper are in three main areas. First, the authors propose slight changes to previously existing GAN architectures, resulting in a model capable of generating samples of impressive quality. Second, the authors propose two metrics to assess the quality of samples generated from a GAN. Lastly, they present empirical results on GANs which are of some interest.<br />
<br />
== Model ==<br />
<br />
The authors propose an auxiliary classifier GAN (AC-GAN) which is a slight variation on previous architectures. Like [[#References|Mirza & Osindero (2014)]], the generator takes the image class to be generated as input in addition to the latent encoding $Z$. Like [[#References|Odena (2016)]] and [[#References|Salimans et al. (2016)]], the discriminator is trained to predict not only whether an observation is real or fake, but to classify each observation as well. The marginal contribution of this paper is to combine these in one model.<br />
<br />
Formally, let $C\sim p_c$ represent the target class label of each generated observation and $Z$ represent the usual noise vector from the latent space. Then the generator function takes both as argument to produce image samples: $X_{fake}=G(c,z)$.The discriminator gives a probability distribution over the source $S$ (real or fake) of the image as well as the class label $C$ being generated. $$D(X)=<P(S|X),P(C|X)>$$<br />
<br />
The objective function for the model thus has two parts, one corresponding to the source $L_S$ and the other to the class $L_C$. $D$ is trained to maximize $L_S + L_C$, while $G$ is trained to maximize $L_C-L_S$. Using the notation of Goodfellow et al. (2014), the loss terms are defined as:<br />
$$L_S=\mathbb{E}_{X\sim p_{data}(x)}[log(P(S=real|X))]+\mathbb{E}_{C,Z\sim p_{C,Z}(c,z)}[log(P(S=fake|G(C,Z)))]$$<br />
$$L_C=\mathbb{E}_{X\sim p_{data}(x)}[log(P(C|X))]+\mathbb{E}_{C,Z\sim p_{C,Z}(c,z)}[log(P(C|G(C,Z)))]$$<br />
<br />
Because G accepts both $C$ and $Z$ as arguments, it is able to learn a mapping $Z\rightarrow X$ that is independent of $C$. The authors argue that all class-specific information should be represented by $C$, allowing $Z$ to represent other factors such as pose, background, etc.<br />
<br />
Lastly the authors split the generation process into many class-specific submodels. They point out that the structure of their model permits this split, though it should technically be possible for even the standard GAN framework by dividing the training data into groups according to their known class labels.<br />
<br />
The changes above result in a model capable of generating (some) image samples with both high resolution and global coherence.<br />
<br />
== GAN Quality Metrics ==<br />
<br />
A much larger part of the paper is spent on measuring the quality of a GAN's output. As the authors say, evaluating a generative model's quality is difficult due to a large number of probabilistic measures (such as average log-likelihood, Parzen window estimates, and visual fidelity [[#References| (Theis et al., 2015)]]) and "a lack of a perceptually meaningful image similarity metric".<br />
<br />
=== Image Discriminability Metric ===<br />
The authors develop two metrics in this paper to address these shortcomings. The first of these is a discriminability metric, the goal of which is to assess the degree to which generated images are identifiable as the class they are meant to represent. Ideally a team of non-expert humans could handle this, but the difficulty of such an approach makes the need for an automated metric apparent. The metric proposed by the authors is to measure the accuracy of a pre-trained image classifier trained on the pristine training data. For this they select a [https://github.com/openai/improved-gan/ modified version of Inception-v3][[#References|(Szegedy et al., 2015)]].<br />
<br />
Other metrics already exist for assessing image quality, the most popular of which is probably the Inception Score [[#References| (Salimans et al., 2016)]]. The authors list two main advantages of their approach over the Inception Score. The first is that accuracy figures are easier to interpret than the Inception Score, which is fairly self-evident. The second advantage of using Inception accuracy instead of Inception Score is that Inception accuracy may be calculated for individual classes, giving a better picture of where the model is strong and where it is weak.<br />
<br />
=== Image Diversity Metric ===<br />
<br />
The second metric proposed by the authors measures the diversity of the generated images. As mentioned above, image diversity is an important quality in a GAN, a common failure they suffer from is 'collapsing', where the generator learns to only output one image that is good at fooling the discriminator [[#References|(Goodfellow et al., 2014)]][[#References|(Salimans et al., 2016)]]. The metric proposed in this section is intended to be complementary to the Inception accuracy, as Inception accuracy would not detect generator collapse.<br />
<br />
For their diversity metric, the authors co-opt an existing metric used to measure the similarity between two images: multi-scale structural similarity (MS-SSIM)[[#References|(Wang et al., 2003)]]. The authors do not go into detail about how the MS-SSIM measure is calculated, except to say that it is one of the more successful ways to predict human perceptual similarity judgements. It can take values on the interval $[0,1]$, and higher values indicate the two images being compared a perceptually more similar. For images generated from a GAN then, the metric should ideally be low, as diversity is desired.<br />
<br />
The authors contribution is to use this metric to assess diversity of a GAN's output. It is a pairwise comparison between two images, so their solution is to compare 100 images (that is $100\cdot 99$ paired comparisons) from each class and take the mean MS-SSIM score.<br />
<br />
The authors make two points about their use of this metric. First, the way they apply the metric is different from how it was originally intended to be used. It is possible that it will not behave as desired because of this. As evidence to the contrary, they state that:<br />
# Visually the metric seems to work. Pairs with high MS-SSIM seem more similar.<br />
# Comparisons are only made between images in the same class, keeping their application of the metric closer to its original use case of measuring the quality of compression algorithms.<br />
# The metric is not saturated. Scores on their generated data vary across the unit interval. If scores were all very close to zero the metric would not be much use.<br />
<br />
The second point they raise is that the mean MS-SSIM metric is not intended as a proxy for entropy of the generator distribution in pixel space. That measure is hard to compute, and in any case is sensitive to trivial changes in the pixels, whereas the true intention of this metric is to measure perceptual similarity.<br />
<br />
== Experimental Results on GAN Properties ==<br />
<br />
[[File:Figure_2_(Bottom).JPG|thumb|500px|right|alt=(Odena et al., 2016) Figure 2: (Left) Inception accuracy (y-axis) of two generators with resolution 128 x 128 (red) and 64 x 64 (blue). Images are resized to the same spatial resolution (x-axis). (Right) Class accuracies from the 128 x 128 AC-GAN at full resolution (x-axis) and downsized to 32 x 32 (y-axis).|(Odena et al., 2016) Figure 2: (Left) Inception accuracy (y-axis) of two generators with resolution 128 x 128 (red) and 64 x 64 (blue). Images are resized to the same spatial resolution (x-axis). (Right) Class accuracies from the 128 x 128 AC-GAN at full resolution (x-axis) and downsized to 32 x 32 (y-axis).]]<br />
<br />
The authors conduct several experiments on their model and proposed metrics. These are summarized in this section.<br />
<br />
=== Higher Resolution Images are more Discriminable ===<br />
<br />
As it is one of the main attractions of this paper, the authors investigate how generating samples at different resolutions affects their discriminability. To achieve this two models are trained, one that generates 64 x 64 resolution images and one that generates 128 x 128 resolution images. These images can be rescaled using bilinear interpolation to make them directly comparable. The authors find that the 128 x 128 AC-GAN (on average) achieves higher discriminability (per the Inception accuracy metric [[#Image Discriminability Metric|introduced above]]) at all resized resolutions. The authors claim that theirs is the first work to investigate how much an image generator is 'making use of its given output resolution'.<br />
<br />
=== Generated Images are both Diverse and Discriminable ===<br />
<br />
A second experiment conducted in the paper aims to investigate how the two metrics they propose interact with each other. They simply calculate the Inception accuracy and mean MS-SSIM score for a batch of generated images from every class in their data and report the correlation between the scores. They find the scores are anti-correlated with $\rho=-0.16$. Because the mean MS-SSIM metric is low for diverse samples, they conclude that accuracy and diversity are actually positively correlated, which is contradictory to the hypothesis that GANs that collapse achieve better sample quality.<br />
<br />
[[File:Figure_10.JPG|thumb|300px|right|alt=(Odena et al., 2016) Figure 10: Mean MS-SSIM scores for 10 ImageNet classes (y-axis) plotted against the number of classes handled by each sub-model (x-axis).|(Odena et al., 2016) Figure 10: Mean MS-SSIM scores for 10 ImageNet classes (y-axis) plotted against the number of classes handled by each sub-model (x-axis).]]<br />
<br />
=== Effect of Class Splits on Image Sample Quality ===<br />
<br />
The final experiment conducted in the paper investigates how the number of classes a fixed GAN model has to generate impacts the diversity of the generated images. In their main model the authors split the data such that each AC-GAN only had to generate images from 10 classes. Here they experimented with changing the number of classes each sub-model has to generate while holding the architecture of the sub-models fixed. They report the mean MS-SSIM score of the generated images from the original classes (when the model had only 10 to generate) at each split level. Perhaps unsurprisingly, the diversity of the outputs drops as the number of classes the model has to handle increases. Giving the model more parameters to handle the larger (and more diverse) set of classes might possibly eliminate this effect.<br />
<br />
Finally, they state that they were unable to get a regular GAN to converge on the task of generating images from one class per model. This could be due to the difficulty of training GANs and the limited amount of training data available for each class rather than any theoretical property of class splits.<br />
<br />
= Results =<br />
<br />
=== Model ===<br />
The authors apply their AC-GAN model to the ImageNet [[#References|(Russakovsky et al., 2015)]] dataset. The data have 1000 classes which the authors split into 100 groups of 10. An AC-GAN model is trained on each group of 10 to give results reported for the paper. The authors give some examples of the images generated from this setup. They note that these are selected to show the success of the model, not give a balanced representation of how good it is:<br />
<br />
[[File:Figure_1_AC-GAN.JPG|thumb|1000px|center|alt=(Odena et al., 2016) Figure 1: Selected images generated by the AC-GAN model for the ImageNet dataset.|(Odena et al., 2016) Figure 1: Selected images generated by the AC-GAN model for the ImageNet dataset.]]<br />
<br />
The apparent value of the model is its ability to generate high-resolution samples with global coherence. From the images given above one must acknowledge they are impressively realistic. The authors link to a [https://photos.google.com/share/AF1QipPzTToH3wxrKoF8l5nWvSNz7D_oS-KB3YMQ6Ji-4XK3AtgJmlb5QCdqRQLqLSjfkw?key=YnhJb2UyMnZwb01oZ2xhUjBraE9fdWU1VVpNZTVB full set] of images generated from every ImageNet class as well. As one would expect from their acknowledgement that images above are selected to be most impressive, not all samples at the linked page are quite so coherent.<br />
<br />
[[File:Figure_9_(Bottom).JPG|thumb|400px|right|alt=(Odena et al., 2016) Figure 9 (Bottom): Each column is a different class. Each row is generated by a different latent encoding $z$.|(Odena et al., 2016) Figure 9 (Bottom): Each column is a different class. Each row is generated by a different latent encoding $z$.]]<br />
<br />
The authors compare their model with state-of-the-art results from [[#References| Salimans et al. (2016)]] on the CIFAR-10 dataset at a 32 x 32 resolution. To score the two models they use Inception Score instead of log-likelihood, which they claim is too inaccurate to be reported. Their model achieves a score of $8.25 \pm 0.07$ versus the previous state-of-the-art of $8.09 \pm 0.07$.<br />
<br />
[[#References|Odena et al. (2016)]] argue that the class conditional generator allows $G$ to learn a representation of $Z$ independent of $C$ in section 3, and give evidence of the claim later in section 4.5 by showing that images generated with a fixed latent vector $z$ but different class labels $c$ have similar global structure (e.g. orientation of the subject) but the subjects (bird species) vary according to the label. Interestingly, the background (especially in the top row) also varies with the class label. This can possibly be attributed to the bird species coming from different areas, hence a seagull might be expected to have an ocean background. Clearly here the model benefits from the fact that the authors grouped similar classes together. A more interesting analysis might show the same comparison between different classes, such as birds and forklifts, to see how global structure is encoded across them.<br />
<br />
The authors also include a discussion of whether their model is overfitting the training data. Their first test is to find the nearest neighbour in the training set of a generated image by the L1 measure in pixel space and visually compare the two images. This is a fairly naive approach, since the L1 loss in pixel space is extremely unlikely to identify whether two images are perceptually similar. Here would have been a good place to use the MS-SSIM metric to identify the nearest neighbours, since it is intended to measure perceptual similarity. The images they provide from this analysis are below.<br />
<br />
[[File:Figure_8.JPG|thumb|500px|center|alt=(Odena et al., 2016) Figure 8: Images generated by the AC-GAN model (top) and their nearest neighbour (L1 measure in pixel space) in the ImageNet dataset (bottom).|(Odena et al., 2016) Figure 8: Images generated by the AC-GAN model (top) and their nearest neighbour (L1 measure in pixel space) in the ImageNet dataset (bottom).]]<br />
<br />
A second check they make is that interpolating between two generated images in the latent space does not result in any discrete transitions or holes in the image interpolation. Such results would be indicative of overfitting. The <br />
images they give as evidence that this is not the case are below.<br />
<br />
[[File:Figure_9_(Top).JPG|thumb|1000px|center|alt=(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.|(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.]]<br />
<br />
Moreover, another way to study the overfitting problem is to explore the latent space affect on the AC-GAN by exploiting the structure of the model. The information representation in AC-GAN includes class information<br />
and a class-independent latent representation $z$. Sampling network with $z$ fixed but altering the class label corresponds to generating samples with the same ‘style’ across multiple classes. Figure 9 (Bottom) shows the class changes for each column, elements of the global structure (e.g. position, layout, background) are preserve.<br />
<br />
[[File:Figure_9_Bottom.JPG|thumb|1000px|center|alt=(Odena et al., 2016) Figure 9 (Bottom): Class-independent information contains global structure about the synthesized image.|(Odena et al., 2016) Figure 9 (Top): Latent space interpolations of the AC-GAN model.]]<br />
=== Image Diversity Metric ===<br />
<br />
Another result the authors report is the performance of their image diversity metric. It is difficult to evaluate quantitatively, but visually we see that the scores do appear to capture the perceptual diversity of the generated class. For example the 'artichoke' generator appears to have collapsed, and has a high score, while the 'promontory' generator seems fairly diverse and has a low score:<br />
<br />
[[File:Figure_3.JPG|thumb|700px|center|alt=(Odena et al., 2016) Figure 3: Mean MS-SSIM scores and sample images from selected generated (top) and real (bottom) ImageNet classes.|(Odena et al., 2016) Figure 3: Mean MS-SSIM scores and sample images from selected generated (top) and real (bottom) ImageNet classes.]]<br />
<br />
= Critique =<br />
=== Model ===<br />
<br />
A major attraction of this paper is the impressive quality of samples generated by the model. GANs often generate samples that are locally plausible but globally not realistic (e.g. a generated image of a dog has fur but the overall shape is not distinguishable). As we have seen in this critique, and as acknowledged by the authors, the most impressive samples are not representative of the model's overall performance.<br />
<br />
The model itself is not a very big advancement of the field. It combines two ideas that are both already prevalent in the research without any other justification than that it seems like a natural thing to do. As [https://openreview.net/forum?id=BkDDM04Ke other reviewers] have noted, investigating how much value the proposed model adds by comparing it with other models that only implement one (or neither) of the changes would have made this paper a slightly more interesting read.<br />
<br />
Another criticism I have about the paper is about how they report their results. To compare with [[#References|Salimans et al. (2016)]] they use Inception Score rather than log-likelihood, which they claim is the standard. Even if their model performed worse by that measure it ought to be included with the caveat they mentioned. The models are evaluated on a different dataset and at a lower spatial resolution than was used for the rest of the paper. By the Inception Score their results are better on average but might not be significantly different given how close they are. Finally, they did not apply the mean MS-SSIM score they developed in this paper to evaluate their model against [[#References|Salimans et al. (2016)]]. This would have been a natural point to make, but instead they generate four samples from each model as their evidence.<br />
<br />
An analysis the authors could have included that was touched upon but not explored in section 4.6 of the paper, and in the [[#Results|Results]] section of this summary, is how the similarity of the classes grouped in each sub-model impact the quality of generated samples. The example I gave above was to compare generated images with the same latent code but very different classes, such as birds and forklifts, to see how the global structure transferred across dissimilar classes.<br />
<br />
A last point to make about the model section is that the authors make some unsupported claims in their discussions of the model's properties. Specifically, they state that their modification to the standard GAN formulation appears to stabilize training but offer no evidence. Another example is their claim that "AC-GANs learn a representation for $z$ that is independent of class label". They site [[#References|Kingma et al (2014)]] as evidence of this. From my review of that paper it does not appear that the authors gave evidence for such a claim.<br />
<br />
=== GAN Quality Metrics ===<br />
<br />
The Inception accuracy metric proposed in this paper has the drawback that it is only applicable in a conditional GAN setting, since the standard GAN framework has no ground-truth labels. It is also true that using a pre-trained classifier is only a proxy for determining how much generated images look like the class they are meant to represent, since classifiers are not perfect. Consider the phenomenon of adversarial attacks on classifiers to see this point. However the advantages the authors list, that the Inception accuracy can be computed on a per-class basis and is easier to interpret than the Inception Score do have merit. The metric does make sense for the task the authors use it for.<br />
<br />
The same can be said for the mean MS-SSIM metric developed in this paper. Visually it appears to be a good indicator of diversity in the GAN's outputs. The authors claim that the mean MS-SSIM is a fast and easy-to-compute metric for perceptual variability and collapsing behaviour in a GAN. It is unclear how fast the metric can be computed since for each class the MS-SSIM has to be computed 100*99 times, once for each pair of images. The authors do not discuss how quickly it can actually be done.<br />
<br />
=== Experimental Results on GAN Properties ===<br />
<br />
The authors included three analyses which I have termed experiments. Of these, the first concluded that images generated at a higher resolution are more discriminable than images generated at lower resolutions, even when they have been resized to be comparable. This does not seem like a very revolutionary conclusion. For one thing, the space of lower resolution images in contained in the higher resolution space. In essence the high resolution model could generate lower resolution images by setting blocks of 4 pixels to the same intensity. It seems unsurprising then that the lower resolution is less discriminable on average. Another reason for this could be that the high resolution model has more parameters, and is trained on higher resolution data, so it has more information with which to reconstruct class information. Finally the authors give a graph of accuracies to show this property, and the average line appears compelling, however the standard errors about the lines suggest they may not be significantly different.<br />
<br />
The second experiment is on the interaction between the Inception accuracy and mean MS-SSIM metric. The author found that they are negatively correlated, and thus that classes that are high quality also tend to be diverse. This is contrary to prevailing wisdom, and since the correlation between them is weak, it appears that it may be only a fluke of the metrics.<br />
<br />
The final experiment is on the effect of class splits on image diversity. The authors found that increasing the number of classes handled by each model reduced the diversity of generated images. They make the claim at the beginning of the paper that they show the number of classes is what makes ImageNet synthesis difficult for GANs. This analysis does point in that direction but is not quite conclusive about the issue. Another analysis they could have included towards showing this is how their Inception accuracy metric and the Inception Score are affected by the number of class splits in their model.<br />
<br />
= Conclusion =<br />
<br />
This paper's main contributions were to introduce a slight variation on previous GAN models, as well as two metrics that can be used to assess the quality of generated images. The modified GAN, dubbed the Auxiliary Classifier GAN, was shown to produce high quality, high resolution samples from ImageNet, but not consistently The authors could have done more to show why their proposed architecture was an improvement over previous methods.<br />
<br />
The metrics introduced are both fairly straightforward and appear to function as they are intended. This being said, the authors could have used them more consistently throughout the paper (such as using the MS-SSIM to find nearest neighbours instead of the L1 pixel space loss). This paper was accepted to ICML 2017 but rejected by ICLR 2018 due to the incremental nature of the model development and the ad hoc nature of the other analyses presented above.<br />
<br />
= References =<br />
# Odena, A., Olah, C., & Shlens, J. (2016). Conditional image synthesis with auxiliary classifier gans. arXiv preprint [http://proceedings.mlr.press/v70/odena17a.html arXiv:1610.09585].<br />
# Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).<br />
# Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems (pp. 2234-2242).<br />
# Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv preprint [https://arxiv.org/abs/1710.10196 arXiv:1710.10196].<br />
# Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint [https://arxiv.org/abs/1411.1784 arXiv:1411.1784].<br />
# van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A. (2016). Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).<br />
# Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning what and where to draw. In Advances in Neural Information Processing Systems (pp. 217-225).<br />
# Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).<br />
# Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).<br />
# Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., & Pande, V. (2015). Massively multitask networks for drug discovery. arXiv preprint [https://arxiv.org/abs/1502.02072 arXiv:1502.02072]<br />
# Odena, A. (2016). Semi-supervised learning with generative adversarial networks. arXiv preprint [https://arxiv.org/abs/1606.01583 arXiv:1606.01583].<br />
# Theis, L., Oord, A. V. D., & Bethge, M. (2015). A note on the evaluation of generative models. arXiv preprint [https://arxiv.org/abs/1511.01844 arXiv:1511.01844].<br />
# Wang, Z., Simoncelli, E. P., & Bovik, A. C. (2003, November). Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on (Vol. 2, pp. 1398-1402). IEEE.<br />
# Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f17Stat946PaperSignUp&diff=31192f17Stat946PaperSignUp2017-11-23T06:04:58Z<p>Jdeng: /* Paper presentation */</p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
[https://docs.google.com/forms/d/e/1FAIpQLSf9DIuUylcR-HCN_ts-uP-10jE4wDuMuzTA4vg3r2KR_uHRWQ/viewform?vc=0&c=0&w=1J Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 12 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 26 || Sakif Khan || 1|| Improved Variational Inference with Inverse Autoregressive Flow || [https://papers.nips.cc/paper/6581-improved-variational-inference-with-inverse-autoregressive-flow Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Improved_Variational_Inference_with_Inverse_Autoregressive_Flow Summary]<br />
|-<br />
|Oct 26 || Amir-Hossein Karimi ||2 || Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling || [https://papers.nips.cc/paper/6096-learning-a-probabilistic-latent-space-of-object-shapes-via-3d-generative-adversarial-modeling Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Learning_a_Probabilistic_Latent_Space_of_Object_Shapes_via_3D_GAN Summary]<br />
|-<br />
|-<br />
|Oct 26 ||Josh Valchar || 3|| Learning What and Where to Draw ||[https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_What_and_Where_to_Draw Summary] <br />
|-<br />
|Oct 31 ||Jimit Majmudar ||4 ||Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition || [https://papers.nips.cc/paper/6258-incremental-boosting-convolutional-neural-network-for-facial-action-unit-recognition.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Incremental_Boosting_Convolutional_Neural_Network_for_Facial_Action_Unit_Recognition Summary]<br />
|-<br />
|Oct 31 || ||6 || || ||<br />
|-<br />
|Nov 2 || Prashanth T.K. || 7|| When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications||[http://proceedings.mlr.press/v70/zhou17c.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_can_Multi-Site_Datasets_be_Pooled_for_Regression%3F_Hypothesis_Tests,_l2-consistency_and_Neuroscience_Applications:_Summary Summary]<br />
|-<br />
|Nov 2 || |||| || ||<br />
|-<br />
|Nov 2 || Haotian Lyu || 9||Learning Important Features Through Propagating Activation Differences|| [http://proceedings.mlr.press/v70/shrikumar17a/shrikumar17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Learning_Important_Features_Through_Propagating_Activation_Differences summary]<br />
|-<br />
|Nov 7 || Dishant Mittal ||10 ||meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting|| [https://arxiv.org/pdf/1706.06197.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting Summary]<br />
|-<br />
|Nov 7 || Omid Rezai|| 11 || Understanding the Effective Receptive Field in Deep Convolutional Neural Networks || [https://papers.nips.cc/paper/6203-understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks.pdf Paper]|| [[Understanding the Effective Receptive Field in Deep Convolutional Neural Networks | Summary]]<br />
|-<br />
|Nov 7 || Rahul Iyer|| 12|| Convolutional Sequence to Sequence Learning || [https://arxiv.org/pdf/1705.03122.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Sequence_to_Sequence_Learning Summary]<br />
|-<br />
|Nov 9 || ShuoShuo Liu ||13 ||Learning the Number of Neurons in Deep Networks|| [http://papers.nips.cc/paper/6372-learning-the-number-of-neurons-in-deep-networks.pdf Paper] || [[Learning the Number of Neurons in Deep Networks | Summary]]<br />
|-<br />
|Nov 9 || Aravind Balakrishnan ||14 || FeUdal Networks for Hierarchical Reinforcement Learning || [http://proceedings.mlr.press/v70/vezhnevets17a/vezhnevets17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=FeUdal_Networks_for_Hierarchical_Reinforcement_Learning Summary]<br />
|-<br />
|Nov 9 || Varshanth R Rao ||15 || Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study || [http://proceedings.mlr.press/v70/ritter17a/ritter17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Cognitive_Psychology_For_Deep_Neural_Networks:_A_Shape_Bias_Case_Study Summary]<br />
|-<br />
|Nov 14 || Avinash Prasad ||16 || Coupled GAN|| [https://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Coupled_GAN Summary]<br />
|-<br />
|Nov 14 || Nafseer Kadiyaravida ||17 || Dialog-based Language Learning || [https://papers.nips.cc/paper/6264-dialog-based-language-learning.pdf Paper] || [[Dialog-based Language Learning | Summary]]<br />
|-<br />
|Nov 14 || Ruifan Yu ||18 || Imagination-Augmented Agents for Deep Reinforcement Learning || [https://arxiv.org/pdf/1707.06203.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Imagination-Augmented_Agents_for_Deep_Reinforcement_Learning Summary] <br />
|-<br />
|Nov 16 || Hamidreza Shahidi ||19 || Teaching Machines to Describe Images via Natural Language Feedback || [https://arxiv.org/pdf/1706.00130 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Teaching_Machines_to_Describe_Images_via_Natural_Language_Feedback Summary]<br />
|-<br />
|Nov 16 || Sachin vernekar ||20 || "Why Should I Trust You?": Explaining the Predictions of Any Classifier || [https://arxiv.org/abs/1602.04938 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=%22Why_Should_I_Trust_You%3F%22:_Explaining_the_Predictions_of_Any_Classifier Summary]<br />
|-<br />
|Nov 16 || Yunqing He ||21 || LightRNN: Memory and Computation-Efficient Recurrent Neural Networks || [https://papers.nips.cc/paper/6512-lightrnn-memory-and-computation-efficient-recurrent-neural-networks]<br />
|| [[LightRNN: Memory and Computation-Efficient Recurrent Neural Networks | Summary]]<br />
|-<br />
||Nov 21 ||Aman Jhunjhunwala ||22 ||Modular Multitask Reinforcement Learning with Policy Sketches ||[http://proceedings.mlr.press/v70/andreas17a/andreas17a.pdf Paper]||[[Modular Multitask Reinforcement Learning with Policy Sketches | Summary]]<br />
|-<br />
|Nov 21 || Michael Honke ||23 || Universal Style Transfer via Feature Transforms|| [https://arxiv.org/pdf/1705.08086.pdf Paper] ||| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Universal_Style_Transfer_via_Feature_Transforms Summary]<br />
|-<br />
|Nov 21 || Venkateshwaran Balasubramanian ||24 || Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition || [https://papers.nips.cc/paper/6335-deep-alternative-neural-network-exploring-contexts-as-early-as-possible-for-action-recognition.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition Summary]<br />
|-<br />
|Nov 23 || Ashish Gaurav ||25 || Deep Exploration via Bootstrapped DQN || [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf Paper] ||[[Deep Exploration via Bootstrapped DQN | Summary]]<br />
|-<br />
|Nov 23 || Ershad Banijamali||26 ||Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ||[http://proceedings.mlr.press/v70/finn17a/finn17a.pdf Paper] || [[Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks | Summary]]<br />
|-<br />
|Nov 23 || Dylan Spicker || 27|| Unsupervised Domain Adaptation with Residual Transfer Networks || [https://papers.nips.cc/paper/6110-unsupervised-domain-adaptation-with-residual-transfer-networks.pdf Paper] || [[Unsupervised Domain Adaptation with Residual Transfer Networks | Summary]]<br />
|-<br />
|Nov 28 || Mike Rudd || 28 || Conditional Image Synthesis with Auxiliary Classifier GANs || [http://proceedings.mlr.press/v70/odena17a.html Paper] || [[Conditional Image Synthesis with Auxiliary Classifier GANs | Summary]]<br />
|-<br />
|Nov 28 || Shivam Kalra ||29 || Hierarchical Question-Image Co-Attention for Visual Question Answering || [https://arxiv.org/pdf/1606.00061.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering Summary]<br />
|-<br />
|Nov 28 || Aditya Sriram ||30 ||Conditional Image Generation with PixelCNN Decoders|| [https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders Summary] <br />
|-<br />
|Nov 30 || Congcong Zhi ||31 || Dance Dance Convolution<br />
|| [http://proceedings.mlr.press/v70/donahue17a/donahue17a.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Dance_Dance_Convolution Summary]<br />
|-<br />
|Nov 30 || Jian Deng || 32|| Automated Curriculum Learning for Neural Networks || [http://proceedings.mlr.press/v70/graves17a/graves17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks Summary]<br />
|-<br />
|Nov 30 ||Elaheh Jalalpour || 33|| || ||<br />
|-<br />
|}<br />
|}</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f17Stat946PaperSignUp&diff=31191f17Stat946PaperSignUp2017-11-23T06:04:23Z<p>Jdeng: /* Paper presentation */</p>
<hr />
<div>=[https://piazza.com/uwaterloo.ca/fall2017/stat946/resources List of Papers]=<br />
<br />
= Record your contributions [https://docs.google.com/spreadsheets/d/1A_0ej3S6ns3bBMwWLS4pwA6zDLz_0Ivwujj-d1Gr9eo/edit?usp=sharing here:]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
[https://docs.google.com/forms/d/e/1FAIpQLSf9DIuUylcR-HCN_ts-uP-10jE4wDuMuzTA4vg3r2KR_uHRWQ/viewform?vc=0&c=0&w=1J Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Oct 12 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Oct 26 || Sakif Khan || 1|| Improved Variational Inference with Inverse Autoregressive Flow || [https://papers.nips.cc/paper/6581-improved-variational-inference-with-inverse-autoregressive-flow Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Improved_Variational_Inference_with_Inverse_Autoregressive_Flow Summary]<br />
|-<br />
|Oct 26 || Amir-Hossein Karimi ||2 || Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling || [https://papers.nips.cc/paper/6096-learning-a-probabilistic-latent-space-of-object-shapes-via-3d-generative-adversarial-modeling Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Learning_a_Probabilistic_Latent_Space_of_Object_Shapes_via_3D_GAN Summary]<br />
|-<br />
|-<br />
|Oct 26 ||Josh Valchar || 3|| Learning What and Where to Draw ||[https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_What_and_Where_to_Draw Summary] <br />
|-<br />
|Oct 31 ||Jimit Majmudar ||4 ||Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition || [https://papers.nips.cc/paper/6258-incremental-boosting-convolutional-neural-network-for-facial-action-unit-recognition.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Incremental_Boosting_Convolutional_Neural_Network_for_Facial_Action_Unit_Recognition Summary]<br />
|-<br />
|Oct 31 || ||6 || || ||<br />
|-<br />
|Nov 2 || Prashanth T.K. || 7|| When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications||[http://proceedings.mlr.press/v70/zhou17c.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_can_Multi-Site_Datasets_be_Pooled_for_Regression%3F_Hypothesis_Tests,_l2-consistency_and_Neuroscience_Applications:_Summary Summary]<br />
|-<br />
|Nov 2 || |||| || ||<br />
|-<br />
|Nov 2 || Haotian Lyu || 9||Learning Important Features Through Propagating Activation Differences|| [http://proceedings.mlr.press/v70/shrikumar17a/shrikumar17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Learning_Important_Features_Through_Propagating_Activation_Differences summary]<br />
|-<br />
|Nov 7 || Dishant Mittal ||10 ||meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting|| [https://arxiv.org/pdf/1706.06197.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=meProp:_Sparsified_Back_Propagation_for_Accelerated_Deep_Learning_with_Reduced_Overfitting Summary]<br />
|-<br />
|Nov 7 || Omid Rezai|| 11 || Understanding the Effective Receptive Field in Deep Convolutional Neural Networks || [https://papers.nips.cc/paper/6203-understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks.pdf Paper]|| [[Understanding the Effective Receptive Field in Deep Convolutional Neural Networks | Summary]]<br />
|-<br />
|Nov 7 || Rahul Iyer|| 12|| Convolutional Sequence to Sequence Learning || [https://arxiv.org/pdf/1705.03122.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Convolutional_Sequence_to_Sequence_Learning Summary]<br />
|-<br />
|Nov 9 || ShuoShuo Liu ||13 ||Learning the Number of Neurons in Deep Networks|| [http://papers.nips.cc/paper/6372-learning-the-number-of-neurons-in-deep-networks.pdf Paper] || [[Learning the Number of Neurons in Deep Networks | Summary]]<br />
|-<br />
|Nov 9 || Aravind Balakrishnan ||14 || FeUdal Networks for Hierarchical Reinforcement Learning || [http://proceedings.mlr.press/v70/vezhnevets17a/vezhnevets17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=FeUdal_Networks_for_Hierarchical_Reinforcement_Learning Summary]<br />
|-<br />
|Nov 9 || Varshanth R Rao ||15 || Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study || [http://proceedings.mlr.press/v70/ritter17a/ritter17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Cognitive_Psychology_For_Deep_Neural_Networks:_A_Shape_Bias_Case_Study Summary]<br />
|-<br />
|Nov 14 || Avinash Prasad ||16 || Coupled GAN|| [https://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Coupled_GAN Summary]<br />
|-<br />
|Nov 14 || Nafseer Kadiyaravida ||17 || Dialog-based Language Learning || [https://papers.nips.cc/paper/6264-dialog-based-language-learning.pdf Paper] || [[Dialog-based Language Learning | Summary]]<br />
|-<br />
|Nov 14 || Ruifan Yu ||18 || Imagination-Augmented Agents for Deep Reinforcement Learning || [https://arxiv.org/pdf/1707.06203.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Imagination-Augmented_Agents_for_Deep_Reinforcement_Learning Summary] <br />
|-<br />
|Nov 16 || Hamidreza Shahidi ||19 || Teaching Machines to Describe Images via Natural Language Feedback || [https://arxiv.org/pdf/1706.00130 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Teaching_Machines_to_Describe_Images_via_Natural_Language_Feedback Summary]<br />
|-<br />
|Nov 16 || Sachin vernekar ||20 || "Why Should I Trust You?": Explaining the Predictions of Any Classifier || [https://arxiv.org/abs/1602.04938 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=%22Why_Should_I_Trust_You%3F%22:_Explaining_the_Predictions_of_Any_Classifier Summary]<br />
|-<br />
|Nov 16 || Yunqing He ||21 || LightRNN: Memory and Computation-Efficient Recurrent Neural Networks || [https://papers.nips.cc/paper/6512-lightrnn-memory-and-computation-efficient-recurrent-neural-networks]<br />
|| [[LightRNN: Memory and Computation-Efficient Recurrent Neural Networks | Summary]]<br />
|-<br />
||Nov 21 ||Aman Jhunjhunwala ||22 ||Modular Multitask Reinforcement Learning with Policy Sketches ||[http://proceedings.mlr.press/v70/andreas17a/andreas17a.pdf Paper]||[[Modular Multitask Reinforcement Learning with Policy Sketches | Summary]]<br />
|-<br />
|Nov 21 || Michael Honke ||23 || Universal Style Transfer via Feature Transforms|| [https://arxiv.org/pdf/1705.08086.pdf Paper] ||| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Universal_Style_Transfer_via_Feature_Transforms Summary]<br />
|-<br />
|Nov 21 || Venkateshwaran Balasubramanian ||24 || Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition || [https://papers.nips.cc/paper/6335-deep-alternative-neural-network-exploring-contexts-as-early-as-possible-for-action-recognition.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition Summary]<br />
|-<br />
|Nov 23 || Ashish Gaurav ||25 || Deep Exploration via Bootstrapped DQN || [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf Paper] ||[[Deep Exploration via Bootstrapped DQN | Summary]]<br />
|-<br />
|Nov 23 || Ershad Banijamali||26 ||Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ||[http://proceedings.mlr.press/v70/finn17a/finn17a.pdf Paper] || [[Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks | Summary]]<br />
|-<br />
|Nov 23 || Dylan Spicker || 27|| Unsupervised Domain Adaptation with Residual Transfer Networks || [https://papers.nips.cc/paper/6110-unsupervised-domain-adaptation-with-residual-transfer-networks.pdf Paper] || [[Unsupervised Domain Adaptation with Residual Transfer Networks | Summary]]<br />
|-<br />
|Nov 28 || Mike Rudd || 28 || Conditional Image Synthesis with Auxiliary Classifier GANs || [http://proceedings.mlr.press/v70/odena17a.html Paper] || [[Conditional Image Synthesis with Auxiliary Classifier GANs | Summary]]<br />
|-<br />
|Nov 28 || Shivam Kalra ||29 || Hierarchical Question-Image Co-Attention for Visual Question Answering || [https://arxiv.org/pdf/1606.00061.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Hierarchical_Question-Image_Co-Attention_for_Visual_Question_Answering Summary]<br />
|-<br />
|Nov 28 || Aditya Sriram ||30 ||Conditional Image Generation with PixelCNN Decoders|| [https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders Summary] <br />
|-<br />
|Nov 30 || Congcong Zhi ||31 || Dance Dance Convolution<br />
|| [http://proceedings.mlr.press/v70/donahue17a/donahue17a.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Dance_Dance_Convolution Summary]<br />
|-<br />
|Nov 30 || Jian Deng || 32|| Automated Curriculum Learning for Neural Networks || [http://proceedings.mlr.press/v70/graves17a/graves17a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks]<br />
|-<br />
|Nov 30 ||Elaheh Jalalpour || 33|| || ||<br />
|-<br />
|}<br />
|}</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31190STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T06:03:48Z<p>Jdeng: </p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{２\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learing rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be notice that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty.<br />
<br />
<center><br />
[[File:ngram.png | frame | center |Fig 2. N-gram policies for different gain signals, truncated at $2 \times 10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===Repeat Copy===<br />
In the repeat copy task (Graves et al., 2014), the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
<center><br />
[[File:rcode.png | frame | center |Fig 3. Target task loss (per output) truncated at $1.1 \times 10^9$ steps.. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<center><br />
[[File:babi.png | frame | center |Fig 4. Completion curves for the bAbI curriculum, truncated at $3.5\times10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
= Conclusion =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximium likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the perform of it and the proposed automated curriculum learning methods.<br />
<br />
= Source =<br />
Graves, Alex, et al. "Automated Curriculum Learning for Neural Networks." arXiv preprint arXiv:1704.03003 (2017).<br />
<br />
Elman, Jeffrey L. "Learning and development in neural networks: The importance of starting small." Cognition 48.1 (1993): 71-99.<br />
<br />
Bengio, Yoshua, et al. "Curriculum learning." Proceedings of the 26th annual international conference on machine learning. ACM, 2009.<br />
<br />
Reed, Scott, and Nando De Freitas. "Neural programmer-interpreters." arXiv preprint arXiv:1511.06279 (2015).<br />
<br />
Gui, Liangke, Tadas Baltrušaitis, and Louis-Philippe Morency. "Curriculum Learning for Facial Expression Recognition." Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 2017.<br />
<br />
Zaremba, Wojciech, and Ilya Sutskever. "Learning to execute." arXiv preprint arXiv:1410.4615 (2014).<br />
<br />
Kneser, Reinhard, and Hermann Ney. "Improved backing-off for m-gram language modeling." Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.<br />
<br />
Weston, Jason, et al. "Towards ai-complete question answering: A set of prerequisite toy tasks." arXiv preprint arXiv:1502.05698 (2015).<br />
<br />
Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31189STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T05:52:06Z<p>Jdeng: /* BAbI */</p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{２\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learing rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be notice that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty.<br />
<br />
<center><br />
[[File:ngram.png | frame | center |Fig 2. N-gram policies for different gain signals, truncated at $2 \times 10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===Repeat Copy===<br />
In the repeat copy task, the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
<center><br />
[[File:rcode.png | frame | center |Fig 3. Target task loss (per output) truncated at $1.1 \times 10^9$ steps.. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<center><br />
[[File:babi.png | frame | center |Fig 4. Completion curves for the bAbI curriculum, truncated at $3.5\times10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
= Conclusion =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximium likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the perform of it and the proposed automated curriculum learning methods.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31188STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T05:50:17Z<p>Jdeng: /* Repeat Copy */</p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{２\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learing rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be notice that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty.<br />
<br />
<center><br />
[[File:ngram.png | frame | center |Fig 2. N-gram policies for different gain signals, truncated at $2 \times 10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===Repeat Copy===<br />
In the repeat copy task, the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
<center><br />
[[File:rcode.png | frame | center |Fig 3. Target task loss (per output) truncated at $1.1 \times 10^9$ steps.. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<br />
= Conclusion =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximium likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the perform of it and the proposed automated curriculum learning methods.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31186STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T05:48:53Z<p>Jdeng: /* N-Gram Language Modelling */</p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{２\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learing rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be notice that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty.<br />
<br />
<center><br />
[[File:ngram.png | frame | center |Fig 2. N-gram policies for different gain signals, truncated at $2 \times 10^8$ steps. All curves are averages over $10$ runs ]]<br />
</center><br />
<br />
===Repeat Copy===<br />
In the repeat copy task, the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<br />
= Conclusion =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximium likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the perform of it and the proposed automated curriculum learning methods.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:babi.png&diff=31185File:babi.png2017-11-23T05:46:41Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:rcode.png&diff=31184File:rcode.png2017-11-23T05:42:41Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ngram.png&diff=31182File:ngram.png2017-11-23T05:37:41Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31181STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T05:35:39Z<p>Jdeng: </p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center><br />
<br />
= Learning Progress Signals =<br />
The learning progress is the measurement of effect of a training sample on the target objective. It usually is computationally undesirable or even impossible to obtain. Therefore the authors consider a range of signals derived from two distinct indicators of learning progress: 1) loss-driven, in the sense that they equate progress with a decrease in some loss; or 2) complexity-driven, when they equate progress with an increase in model complexity.<br />
<br />
=== Loss-driven Progress===<br />
The loss-driven progress signals compare the predictions made by the model before and after training on some sample $\mathbf{x}$. <br />
<br />
'''Prediction gain(PG)'''<br />
Prediction gain is defined as the instantaneous change in loss for a sample $\mathbf{x}$<br />
\[<br />
v_{PG}:=L(\mathbf{x},\theta)-L(\mathbf{x},\theta')<br />
\]<br />
<br />
'''Gradient prediction gain (GPG)'''<br />
This measures the magnitude of the gradient vector, which has been used an indicator of salience in the active learning samples<br />
\[<br />
v_{GPG}:= || \triangledown L(\mathbf{x},\theta)||^2_2<br />
\]<br />
<br />
'''Self prediction gain (SPG)'''<br />
Self prediction gain samples a second time from the same task to address the bias problem of PG<br />
\[<br />
v_{SPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k<br />
\]<br />
<br />
<br />
'''Target prediction gain (TPG)'''<br />
\[<br />
v_{TPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_N<br />
\]<br />
Although this estimator might seem like the most accurate measure so far, it tends to suffer from high variance. <br />
<br />
'''Mean prediction gain (MPG)'''<br />
\[<br />
v_{MPG}:=L(\mathbf{x}',\theta)-L(\mathbf{x}',\theta'), \quad \mathbf{x}' \sim D_k, k \sim U_N,<br />
\]<br />
where $U_N$ denotes the uniform distribution on $\{1,\ldots,N\}$.<br />
<br />
===Complexity-driven Progress===<br />
In the stochastic variational inference framework, a variational posterior $P_\phi(\theta)$ over the network weights is maintained during training, with a single weight sample drawn for each training example. An adaptive prior $Q_\psi(\theta)$ are reused for every network weight. In this paper, both of $P$ and $Q$ are set as a diagonal Gaussian distribution, such the complexity cost can be computed analytically <br />
\[<br />
KL(P_\phi|| Q_\psi) = \frac{(\mu_\phi-\mu_\psi)^2+\sigma^2_\phi-\sigma^2_\psi}{２\sigma^2_\psi}+\ln\Big( \frac{\sigma_\psi}{\sigma_\phi} \Big)<br />
\]<br />
<br />
'''Variational complexity gain (VCG)'''<br />
\[v_{VCG}:= KL(P_{\phi'}|| Q_{\psi'}) - KL(P_\phi|| Q_\psi)\]<br />
<br />
'''Gradient variational complexity gain (GVCG)'''<br />
\[<br />
v_{GVCG}:= [\triangledown_{\phi,\psi} KL(P_\phi|| Q_\psi)]^T \triangledown_\phi \mathbb{E}_{\theta \sim P_\phi} L(\mathbf{x},\theta)<br />
\]<br />
<br />
'''L2 gain (L2G)'''<br />
\[<br />
v_{L2G}:=|| \theta' ||^2_2 -|| \theta ||^2_2, \quad v_{GL2G}:=\theta^T [\triangledown_\theta L(\mathbf{x}, \theta)]<br />
\]<br />
<br />
= Experiments =<br />
To test the proposed approach, the authors applied all the gains to three tasks suites: $n$-gram models, repeat copy, and the bAbI tasks.<br />
<br />
Unidirectional LSTM network architecture was used for all experiments, and cross-entropy was used as loss function. The neural network was optimized by RMSProp with momentum of $0.9$<br />
and a learing rate $10^{-5}$. The parameters for Exp3S algorithm were $\eta = 10^{-3}, \beta = 0, \epsilon = 0.05$. All experiments were repeated $10$ times with different random initializations of network weights. Two performance benchmarks are 1) a fixed uniform policy over all the tasks and 2) directly training on the target task (where applicable).<br />
<br />
===N-Gram Language Modelling===<br />
The first experiment is the character-level KneserNey $n$-gram models (Kneser and Ney, 1995) on the King James Bible data from the Canterbury corpus, with the maximum depth parameter $n$ ranging<br />
between $0$ to $10$. It should be notice that the amount of linguistic structure increases monotonically with $n$. <br />
<br />
Fig. 2 shows that most of the complexity-based gain signals (L2G, GL2G, GVCG) progress rapidly through the curriculum before focusing strongly on the $10$-gram task. The loss-driven progress<br />
(PG, GPG, SPG, TG) also tend to move towards higher $n$, although more slowly and with less certainty.<br />
<br />
<br />
===Repeat Copy===<br />
In the repeat copy task, the network is asked to repeat a random sequence a given number of times. Fig. 3 shows that GVCG solves the target task about twice<br />
as fast as uniform sampling for VI training, and that the PG, SPG and TPG gains are somewhat faster than uniform, especially in the early stages.<br />
<br />
===BAbI===<br />
The bAbI dataset (Weston et al., 2015) includes $20$ synthetic question-answering problems designed to test the basic reasoning capabilities of machine learning models. BAbI was not specifically designed for curriculum learning, but some of the tasks follow a natural ordering, such as ‘Two Arg Relations’, ‘Three Arg Relations’. The authors hope that an efficient syllabus could be<br />
found for learning the whole set.<br />
<br />
Fig. 4 shows that prediction gain (PG) clearly improved on uniform sampling in terms of both learning speed and number of tasks completed; for SPG the same benefits were visible, though less pronounced. The other gains were either roughly equal to or worse than uniform.<br />
<br />
<br />
= Conclusion =<br />
1. The experiments suggest that a stochastic syllabus can improve the significant gains, when a suitable progress signal is used.<br />
<br />
2. The uniform random task is a surprisingly strong benchmark.<br />
<br />
3. The learning progress is best evaluated on a local, rather than global basis. In maximium likelihood training, prediction gain is the most consistent signal, while in variational inference<br />
training gradient variational complexity gain performed best.<br />
<br />
= Critique =<br />
In curriculum learning, a popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting. It is interesting to compare the perform of it and the proposed automated curriculum learning methods.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:alg.png&diff=31177File:alg.png2017-11-23T05:26:07Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Automated_Curriculum_Learning_for_Neural_Networks&diff=31176STAT946F17/ Automated Curriculum Learning for Neural Networks2017-11-23T05:25:48Z<p>Jdeng: Created page with "= Introduction = Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more co..."</p>
<hr />
<div>= Introduction =<br />
<br />
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. The idea of training a learning machine with a curriculum can be traced back<br />
at least to Elman (1993). The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level. <br />
<br />
However curriculum learning has only recently become prevalent in the field (e.g., Bengio et al., 2009), due in part to the greater complexity of problems now being considered. In particular,<br />
recent work on learning programs with neural networks has relied on curricula to scale up to longer or more complicated tasks (Reed and de Freitas, 2015, Gui et al. 2017). We expect this trend<br />
to continue as the scope of neural networks widens, with deep reinforcement learning providing fertile ground for structured learning.<br />
<br />
One reason for the slow adoption of curriculum learning is that it’s effectiveness is highly sensitive to the mode of progression through the tasks. One popular approach is to define a hand-chosen performance threshold for advancement to the next task, along with a fixed probability of returning to earlier tasks, to prevent forgetting (Sutskever and Zaremba, 2014). However, as well as introducing hard-to-tune parameters, this poses problems for curricula where appropriate thresholds may be unknown or variable across tasks.<br />
<br />
The main contribution of the paper is that a stochastic policy, continuously adapted to optimize learning progress is proposed. Given a progress signal that can be evaluated for each<br />
training example, we use a multi-armed bandit algorithm to find a stochastic policy over the tasks that maximizes overall progress. The bandit is non-stationary because the behaviour of the network, and hence the optimal policy, evolves during training. Moreover variants of prediction gain, and also a novel class of progress signals which we refer to as complexity gain are considered in this paper.<br />
<br />
= Model =<br />
A task is a distribution $D$ over sequences from $\mathcal{X}$ . A curriculum is an ensemble of tasks $D_1, \ldots D_N$ , a sample is an example drawn from one of the tasks of the curriculum,<br />
and a syllabus is a time-varying sequence of distributions over tasks. A neural network is considered as a probabilistic model $p_\theta$ over $\mathcal{X}$, whose parameters are denoted $\theta$.<br />
<br />
The expected loss of the network on the $k$-th task is <br />
\[<br />
\mathcal{L}_k( \theta) := \mathbb{E}_{\mathbf{x} \sim D_k} L(\mathbf{x}, \theta),<br />
\]<br />
where $L(\mathbf{x}, \theta):= -\log_{p_\theta}(\mathbf{x})$ is the sample loss on $\mathbf{x}$. <br />
<br />
A curriculum containing $N$ tasks as an $N$-armed bandit, and a syllabus as an adaptive policy which seeks to maximize payoffs from this bandit. In the bandit setting, an agent selects a sequence of arms (actions) $a_1,\ldots, a_T$ over T rounds of play. After each round, the selected arm yields a payoff $r_t$; the payoffs for the other arms are not observed.<br />
===Adversarial Multi-Armed Bandits===<br />
The classic algorithm for adversarial bandits is Exp3, which minimize regret with respect to the single best arm evaluated over the whole history. However, in the case of training neural network, an arm is optimal for a portion of the history, then another arm, and so on; the best strategy is then piecewise stationary. The Fixed Share method addresses this issue by using an $\epsilon$-greedy strategy. It is known as the Exp3.S algorithm. <br />
<br />
On round $t$, the agent selects an arm stochastically according to a policy $\pi_t$ . This policy is defined by a set of weights $w_t$,<br />
\[<br />
\pi_t(i) := (1-\epsilon)\frac{e^{w_{t,i}}}{\sum_{j=1}^N e^{w_{t,j}}}+\frac{\epsilon}{N} <br />
\]<br />
\[<br />
w_{t,i}:= \log \big[ (1-\alpha_t)\exp\{ w_{t-1,i} +\eta \bar{r}_{t-1,i}^\beta \} +\frac{\alpha_t}{N-1}\sum_{j \ne i} \exp\{ w_{t-1,j} +\eta \bar{r}_{t-1,j}^\beta \} \big]<br />
\]<br />
\[<br />
w_{1,i} = 0, \quad \alpha_t = t^{-1} , \quad \bar{r}_{s,i}^\beta = \frac{r_s \mathbb{I}_{[a_s = i]}+ \beta}{ \pi_s(i) }<br />
\]<br />
<br />
===Reward Scaling===<br />
The appropriate step size $\eta$ depends on the magnitudes of the rewards, which may not be known. The magnitude of reward depends strongly on the gain signal used to measure<br />
learning progress, as well as varying over time as the model learns. To address this issue, all rewards are adaptively rescale to $[-1,1]$ by<br />
\[<br />
r_t = \begin{cases}<br />
-1 &\quad \text{if } \hat{r}_t < q^{l}_t\\<br />
1 &\quad \text{if } \hat{r}_t > q^{h}_t\\<br />
\frac{2(\hat{r}_t-q^l_t)}{q^h_t-q^l_t} -1 , &\quad \text{otherwise.}<br />
\end{cases}<br />
\]<br />
where $q^l_t$ and $q^h_t$ are the quantiles of the history of unscaled rewards up to time $t$. The authors chose them to be $20$-th and $80$-th percentiles respectively.<br />
<br />
===Algorithm===<br />
The automated curriculum learning is summarized as followed,<br />
<br />
where $\tau(\mathbf{x})$ is the length of the longest input sequence. Since the processing time of each task may be different.<br />
<center><br />
[[File:alg.png | frame | center |Fig 1. Automated Curriculum Learning Algorithm ]]<br />
</center></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model-Agnostic_Meta-Learning_for_Fast_Adaptation_of_Deep_Networks&diff=31024Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks2017-11-21T19:15:55Z<p>Jdeng: </p>
<hr />
<div>='''Introduction & Background'''=<br />
Learning quickly is a hallmark of human intelligence, whether it involves recognizing objects from a few examples or quickly learning new skills after just minutes of experience. In this work, we propose a meta-learning algorithm that is general and model-agnostic, in the sense that it can be directly applied to any learning problem and model that is trained with a gradient descent procedure. Our focus is on deep neural network models, but we illustrate how our approach can easily handle different architectures and different problem settings, including classification, regression, and policy gradient reinforcement learning, with minimal modification. Unlike prior meta-learning methods that learn an update function or learning rule (Schmidhuber, 1987; Bengio et al., 1992; Andrychowicz et al., 2016; Ravi & Larochelle, 2017), this algorithm does not expand the number of learned parameters nor place constraints on the model architecture (e.g. by requiring a recurrent model (Santoro et al., 2016) or a Siamese network (Koch, 2015)), and it can be readily combined with fully connected, convolutional, or recurrent neural networks. It can also be used with a variety of loss functions, including differentiable supervised losses and nondifferentiable reinforcement learning objectives.<br />
<br />
The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task. The paper shows the effectiveness of the proposed algorithm in different domains, including classification, regression, and reinforcement learning problems.<br />
<br />
='''Model-Agnostic Meta Learning (MAML)'''=<br />
The goal of the proposed model is rapid adaptation. This setting is usually formalized as few-shot learning.<br />
<br />
=== Problem set-up ===<br />
The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few data points and training iterations. To do so. the model is trained during a meta-learning phase on a set of tasks, such that it can then be adapted to a new task using only a small number of parameter updates. In effect, the meta-learning problem treats entire tasks as training examples. <br />
<br />
Let us consider a model denoted by $f$, that maps the observation $\mathbf{x}$ into the output variable $a$. During meta-learning, the model is trained to be able to adapt to a large or infinite number of tasks. <br />
<br />
Let us consider a generic notion of task as below. Each task $\mathcal{T} = \{\mathcal{L}(\mathbf{x}_1.a_1,\mathbf{x}_2,a_2,..., \mathbf{x}_H,a_H), q(\mathbf{x}_1),q(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t),H \}$, consists of a loss function $\mathcal{L}$, a distribution over initial observations $q(\mathbf{x}_1)$, a transition distribution $q(\mathbf{x}_{t+1}|\mathbf{x}_t)$, and an episode length $H$. In i.i.d. supervised learning problems,<br />
the length $H =1$. The model may generate samples of length $H$ by choosing an output at at each time $t$. The cost $\mathcal{L}$ provides a task-specific feedback, which is defined based on the nature of the problem. <br />
<br />
A distribution over tasks is denoted by $p(\mathcal{T})$. In the K-shot learning setting, the model is trained to learn a new task $\mathcal{T}_i$ drawn from $p(\mathcal{T})$ from only K samples drawn from $q_i$ and feedback $\mathcal{L}_{\mathcal{T}_i}$ generated by $\mathcal{T}_i$. During meta-training, a task $\mathcal{T}_i$ is sampled from $p(\mathcal{T})$, the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti. The model f is then improved by considering how the test error on new data from $q_i$ changes with respect to the parameters. In effect, the test error on sampled tasks $\mathcal{T}_i$ serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from $p(\mathcal{T})$, and meta-performance is measured by the model’s performance after learning from K samples.<br />
<br />
=== MAML Algorithm ===<br />
[[File:model.png|200px|right|thumb|Figure 1: Diagram of the MAML algorithm]]<br />
The paper proposes a method that can learn the parameters of any standard model via meta-learning in such a way as to prepare that model for fast adaptation. The intuition behind this approach is that some internal representations are more transferrable than others. Since the model will be fine-tuned using a gradient-based learning rule on a new task, we will aim to learn a model in such a way that this gradient-based learning rule can make rapid progress on new tasks drawn from $p(\mathcal{T})$, without overfitting. In effect, we will aim to find model parameters that are sensitive to changes in the task, such that small changes in the parameters will produce large improvements on the loss function of any task drawn from $p(\mathcal{T})$, see Fig 1.<br />
<br />
Note that there is no assumption about the form of the model. Only assumption is that it is parameterized by a vector of parameters $\theta$, and the loss is smooth so that the parameters can be leaned using gradient-based techniques. Formally lets assume that the model is denoted by $f_{\theta}$. When adapting<br />
to a new task $\mathcal{T}_i $, the model’s parameters $\theta$ become $\theta_i'$. In our method, the updated parameter vector $\theta_i'$ is computed using one or more gradient descent updates on task $\mathcal{T}_i $. For example, when using one gradient update:<br />
<br />
$$<br />
\theta_i ' = \theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta}).<br />
$$<br />
<br />
Here $\alpha$ is a the learning rate of each task and considered as a hyperparameter. They consider a single step of update for the rest of the paper, for the sake of the simplicity. <br />
<br />
The model parameters are trained by optimizing for the performance<br />
of $f_{\theta_i'}$ with respect to $\theta$ across tasks sampled from $p(\mathcal{T})$. More concretely, the meta-objective is as follows: <br />
<br />
$$<br />
\min_{\theta} \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'}) = \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta - \alpha \nabla_{\theta \mathcal{L}_{\mathcal{T}_i}}(f_{\theta})})<br />
$$<br />
<br />
Note that the meta-optimization is performed over the model parameters $\theta$, whereas the objective is computed using the updated model parameters $\theta'$. The model aims to optimize the model parameters such that one or a small number of gradient step on a new task will produce maximally effective behavior on that task. <br />
<br />
Therefore the meta-learning across the tasks is performed via stochastic gradient descent (SGD), such that the model parameters $\theta$ are updated as:<br />
<br />
$$<br />
\theta \gets \theta - \beta \nabla_{\theta } \sum \limits_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})<br />
$$<br />
where $\beta$ is the meta step size. Outline of the algorithm is shown in Algorithm 1. <br />
[[File:ershad_alg1.png|500px|center|thumb]]<br />
<br />
The MAML meta-gradient update involves a gradient through a gradient. Computationally, this requires an additional backward pass through f to compute Hessian-vector products, which is supported by standard deep learning libraries such as TensorFlow.<br />
<br />
='''Different Types of MAML'''=<br />
In this section the MAML algorithm is discussed for different supervised learning and reinforcement learning tasks. The differences between each of these tasks are in their loss function and the way the data is generated. In general, this method does not require additional model parameters nor using any additional meta-learner to learn the update of parameters. Compared to other approaches that tend to “learn to compare new examples in a learned metric space using e.g. Siamese networks or recurrence with attention mechanisms”, the proposed method can be generalized to any other problems including classification, regression and reinforcement learning. <br />
<br />
=== Supervised Regression and Classification ===<br />
Few-shot learning is well-studied in this field. For these two types of tasks the horizon $H$ is equal to 1, since the data points are generated i.i.d. <br />
<br />
Although any common classification and regression objectives can be used as the loss, the paper uses the following losses for these two tasks. <br />
<br />
Regression : For regression we use the mean-square error (MSE):<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \parallel f_{\theta} (\mathbf{x}^{(j)} - \mathbf{y}^{(j)})\parallel_2^2<br />
$$<br />
<br />
where $\mathbf{x}^{(j)}$ and $\mathbf{y}^{()j}$ are the input/output pair sampled from task $\mathcal{T}_i$. In K-shot regression tasks, K input/output pairs are provided for learning for each task. <br />
<br />
Classification: For classification we use the cross entropy loss:<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i} (f_{\theta}) = \sum \limits_{\mathbf{x}^{(j)}, \mathcal{y}^{(j)} \sim \mathcal{T}_i} \mathbf{y}^{(j)} \log f_{\theta}(\mathbf{x}^{(j)}) + (1-\mathbf{y}^{(j)}) \log (1-f_{\theta}(\mathbf{x}^{(j)}))<br />
$$<br />
<br />
According to the conventional terminology, K-shot classification tasks use K input/output pairs from each class, for a total of $NK$ data points for N-way classification.<br />
<br />
Given a distribution over tasks, these loss functions can be directly inserted into the equations in the previous section to perform meta-learning, as detailed in Algorithm 2.<br />
[[File:ershad_alg2.png|500px|center|thumb]]<br />
<br />
=== Reinforcement Learning ===<br />
In reinforcement learning (RL), the goal of few-shot meta learning is to enable an agent to quickly acquire a policy for a new test task using only a small amount of experience in the test setting. A new task might involve achieving a new goal or succeeding on a previously trained goal in a new environment. For example an agent may learn how to navigate mazes very quickly so that, when faced with a new maze, it can determine how to reliably reach the exit with only a few samples.<br />
<br />
Each RL task $\mathcal{T}_i$ contains an initial state distribution $q_i(\mathbf{x}_1)$ and a transition distribution $q_i(\mathbf{x}_{t+1}|\mathbf{x}_t,a_t)$ , and the loss $\mathcal{L}_{\mathcal{T}_i}$ corresponds to the (negative) reward function $R$. The entire task is therefore a Markov decision process (MDP) with horizon H, where the learner is allowed to query a limited number of sample trajectories for few-shot learning. Any aspect of the MDP may change across tasks in $p(\mathcal{T})$. The model being learned, $f_{\theta}$, is a policy that maps from states $\mathbf{x}_t$ to a distribution over actions $a_t$ at each timestep $t \in \{1,...,H\}$. The loss for task $\mathcal{T}_i$ and model $f_{\theta}$ takes the form<br />
<br />
$$<br />
\mathcal{L}_{\mathcal{T}_i}(f_{\theta}) = -\mathbb{E}_{\mathbf{x}_t,a_t \sim f_{\theta},q_{\mathcal{T}_i}} \big [\sum_{t=1}^H R_i(\mathbf{x}_t,a_t)\big ]<br />
$$<br />
<br />
<br />
In K-shot reinforcement learning, K rollouts from $f_{\theta}$ and task $\mathcal{T}_i$, $(\mathbf{x}_1,a_1,...,\mathbf{x}_H)$, and the corresponding rewards $ R(\mathbf{x}_t,a_t)$, may be used for adaptation on a new task $\mathcal{T}_i$.<br />
<br />
Since the expected reward is generally not differentiable due to unknown dynamics, we use policy gradient methods to estimate the gradient both for the model gradient update(s) and the meta-optimization. Since policy gradients are an on-policy algorithm, each additional gradient step during the adaptation of $f_{\theta}$ requires new samples from the current policy $f_{\theta_i}$ . We detail the algorithm in Algorithm 3, which has the same structure as Algorithm 2 but also which requires sampling trajectories from the environment corresponding to task $\mathcal{T}_i$ in steps 5 and 8.<br />
[[File:ershad_alg3.png|500px|center|thumb]]<br />
<br />
='''Experiments'''=<br />
<br />
=== Regression ===<br />
We start with a simple regression problem that illustrates the basic principles of MAML. Each task involves regressing from the input to the output of a sine wave, where the amplitude and phase of the sinusoid are varied between tasks. Thus, $p(\mathcal{T})$ is continuous, and the input and output both have a dimensionality of 1. During training and testing, datapoints are sampled uniformly. The loss is the mean-squared error between the prediction and true value. The regressor is a neural network model with 2 hidden layers of size 40 with ReLU nonlinearities. When training with MAML, we use one gradient update with K = 10 examples with a fixed step size 0.01, and use Adam as the metaoptimizer [2]. The baselines are likewise trained with Adam. To evaluate performance, we fine-tune a single meta-learned model on varying numbers of K examples, and compare performance to two baselines: (a) pre-training on all of the tasks, which entails training a network to regress to random sinusoid functions and then, at test-time, fine-tuning with gradient descent on the K provided points, using an automatically tuned step size, and (b) an oracle which receives the true amplitude and phase as input.<br />
<br />
We evaluate performance by fine-tuning the model learned by MAML and the pre-trained model on $K = \{ 5,10,20 \}$ datapoints. During fine-tuning, each gradient step is computed using the same $K$ datapoints. Results are shown in Fig 2.<br />
<br />
<br />
[[File:ershad_results1.png|500px|center|thumb|Figure 2: Few-shot adaptation for the simple regression task. Left: Note that MAML is able to estimate parts of the curve where there are no datapoints, indicating that the model has learned about the periodic structure of sine waves. Right: Fine-tuning of a model pre-trained on the same distribution of tasks without MAML, with a tuned step size. Due to the often contradictory outputs on the pre-training tasks, this model is unable to recover a suitable representation and fails to extrapolate from the small number of test-time samples.]]<br />
<br />
=== Classification ===<br />
<br />
For classification evaluation, Omniglot and MiniImagenet datasets are used. The Omniglot dataset consists of 20 instances of 1623 characters from 50 different alphabets. <br />
<br />
The experiment involves fast learning of N-way classification with 1 or 5 shots. The problem of N-way classification is set up as follows: select N unseen classes, provide the model with K different instances of each of the N classes, and evaluate the model’s ability to classify new instances within the N classes. For Omniglot, 1200 characters are selected randomly for training, irrespective of alphabet, and use the remaining for testing. The Omniglot dataset is augmented with rotations by multiples of 90 degrees.<br />
<br />
The model follows the same architecture as the embedding function that has 4 modules with a 3-by-3 convolutions and 64 filters, followed by batch normalization, a ReLU nonlinearity, and 2-by-2 max-pooling. The Omniglot images are downsampled to 28-by-28, so the dimensionality of the last hidden layer is 64. The last layer is fed into a softmax. For Omniglot, strided convolutions is used instead of max-pooling. For MiniImagenet, 32 filters per layer are used to reduce overfitting. In order to also provide a fair comparison against memory-augmented neural networks [3] and to test the flexibility of MAML, the results for a non-convolutional network are also provided. <br />
<br />
[[File:ershad_results2.png|500px|center|thumb|Table 1: Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The $\pm$ shows 95% confidence intervals over tasks. ]]<br />
<br />
=== Reinforcement Learning ===<br />
Several simulated continuous control environments are used for RL evaluation. In all of the domain, the MAML model is a neural network policy with two hidden layers of size 100, and ReLU activations. The gradient updates are computed using vanilla policy gradient and trust-region policy optimization (TRPO) is used as the meta-optimizer.<br />
<br />
In order to avoid computing third derivatives, finite differences are computed to <br />
compute the Hessian-vector products for TRPO. For both learning and meta-learning updates, we use the standard linear feature baseline proposed by [4], which is fitted separately at each iteration for each sampled task in the batch. <br />
<br />
Three baseline models for the comparison are: <br />
(a) pretraining one policy on all of the tasks and then fine-tuning<br />
(b) training a policy from randomly initialized weights<br />
(c) an oracle policy which receives the parameters of the task as input, which for the tasks below corresponds to a goal position, goal direction, or goal velocity for the agent. <br />
<br />
The baseline models of (a) and (b) are fine-tuned with gradient descent with a manually tuned step size.<br />
<br />
2D Navigation: In the first meta-RL experiment, the authors study a set of tasks where a point agent must move to different goal positions in 2D, randomly chosen for each task within a unit square. The observation is the current 2D position, and actions correspond to velocity commands clipped to be in the range [-0.1; 0.1]. The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100. The policy was trained with MAML <br />
to maximize performance after 1 policy gradient update using 20 trajectories. They compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. Results are shown in Fig. 3.<br />
<br />
[[File:ershad_results3.png|500px|center|thumb|Figure 3: Top: quantitative results from 2D navigation task, Bottom: qualitative comparison between model learned with MAML and with fine-tuning from a pretrained network ]]<br />
<br />
Locomotion. To study how well MAML can scale to more complex deep RL problems, we also study adaptation on high-dimensional locomotion tasks with the MuJoCo simulator [5]. The tasks require two simulated robots – a planar cheetah and a 3D quadruped (the “ant”) – to run in a particular direction or at a particular velocity. In the goal velocity experiments, the reward is the negative absolute value between the current velocity of the agent and a goal, which is chosen uniformly at random between 0 and 2 for the cheetah and between 0 and 3 for the ant. In the goal direction experiments, the reward is the magnitude of the velocity in either the forward or backward direction, chosen at random for each task in p(T ). The horizon is H = 200, with 20 rollouts per gradient step for all problems except the ant forward/backward task, which used 40 rollouts per step. The results in Figure 5 show that MAML learns a model that can quickly adapt its velocity and direction with even <br />
just a single gradient update, and continues to improve with more gradient steps. The results also show that, on these challenging tasks, the MAML initialization substantially outperforms random initialization and pretraining.<br />
[[File:ershad_results4.png|500px|center|thumb|Figure 4: Reinforcement learning results for the half-cheetah and ant locomotion tasks, with the tasks shown on the far right. ]]<br />
<br />
A conceptual method to achieve fast adaptation in language modeling tasks ( not been experimented on by the authors) would be to explore methods of attaching an Attention Kernel which results in a simple and differentiable loss. It has been implemented in One-Shot Language Modeling along with state-of-the-art improvements in one-shot learning on Imagenet and Omniglot [7].<br />
<br />
='''Conclusion'''=<br />
<br />
The paper introduced a meta-learning method based on learning easily adaptable model parameters through gradient descent. The approach has a number of benefits. It is simple and does not introduce any learned parameters for meta-learning. It can be combined with any model representation that is amenable to gradient-based training, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since the method merely produces a weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though it demonstrates state-of-the-art results on classification with only one or five examples per class. The authors also show that the method can adapt an RL agent using policy gradients and a very modest amount of experience.<br />
<br />
='''Critique'''=<br />
From my opinion, the Model-Agnostic Meta-Learning looks like a simplified curriculum learning. It treats all tasks the same over the whole training history, and does not consider the difficulty of the tasks and the adaption of neural network to the task. Curriculum learning would be a good idea to speed up the training.<br />
<br />
<br />
='''References'''=<br />
# Schmidhuber, J¨urgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.<br />
# Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua B. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.<br />
# Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.<br />
# Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.<br />
# Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.<br />
# Videos the learned policies can be found in https://sites.google.com/view/maml.<br />
# Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra. "Matching Networks for One Shot Learning". arXiv:1606.04080 [cs.LG]<br />
<br />
Implementation Example: https://github.com/cbfinn/maml</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Alternative_Neural_Network:_Exploring_Contexts_As_Early_As_Possible_For_Action_Recognition&diff=31020Deep Alternative Neural Network: Exploring Contexts As Early As Possible For Action Recognition2017-11-21T18:56:20Z<p>Jdeng: /* Alternative Layer */</p>
<hr />
<div>==Introduction==<br />
<br />
Action recognition deals with recognizing and classifying the actions or activities done by humans or other agents in a video clip. In action recognition, contexts contribute semantic clues for action recognition in video(See Fig below[8]). Conventional Neural Networks [1,2,3] and their shifted version 3D CNNs [4,5,6] have been employed in action recognition but they identify and aggregate the contexts at later stages. <br />
[[File:ActionRecognition1.jpg|center|400px|border|context and action region]]<br />
<br />
The authors have come up with a strategy to identify contexts in the videos as early as possible and leverage their evolutions for action recognition. Contexts contribute semantic clues for action recognition in videos. The networks themselves involve a lot of layers, with the first layer typically having a receptive field (RF) that outputs only extra local features. As we go deeper into the layers the Receptive Fields expand and we start getting the contexts. The authors identified that increasing the number of layers will only cause additional burden in terms of handling the parameters and contexts could be obtained even in the earlier stages. The authors also cite the papers [9,10] that relate the CNNs and the visual systems of our brain, one remarkable difference being the abundant recurrent connections in our brain compared to the forward connections in the CNNs. In summary, this paper proposes a novel neural network, called deep alternative neural network (DANN), which is a based method for action recognition. The novel component is called an "alternative layer" which is composed of a volumetric convolutional layer followed by a recurrent layer. In addition, the authors also propose a new approach to select network input based on optical flow. The validity of DANN is carried out on HMDB51 and UCF101 datasets and it is observed that the proposed method achieves comparable performance against state of the art methods.<br />
<br />
The main contributions in the paper can be summarized as follows: <br />
* A Deep Alternative Neural Network (DANN) is proposed for action recognition. <br />
* DANN consists of alternative volumetric convolutional and recurrent layers. <br />
* An adaptive method to determine the temporal size of the video clip <br />
* A volumetric pyramid pooling layer to resize the output before fully connected layers.<br />
<br />
===Related Work===<br />
There are already exists a very related paper ([11]) in the literature which proposed a similar alternation architecture. In particular, the similarity between the authors work and the aforementioned paper is that they both propose alternating CNN-RNN architectures. This similarity between the two works was noted by Reviewer 1 in the NIPS review process.<br />
<br />
=== Optic Flow ===<br />
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.<br />
It can be used for affordance perception, the ability to discern possibilities for action within the environment.<br />
<br />
==Deep Alternative Neural Network:==<br />
===Adaptive Network Input===<br />
The input size of the video clip is generally determined empirically and various approaches have been taken in the past with a different number of frames. For instance, many previous papers suggested to used shorter intervals of between 1 to 16 frames. However, more recent work[9] recognized that human-based actions often “span tens or hundreds of frames” and longer intervals such as 60 frames will outperform the one with a shorter interval. However, there’s still no systematic way of determining the number of frames for input size of the network. This serves the motives for the authors of this paper to develop this adaptive method. Past research shows that motion energy intensity induced by human activity exhibits a regular periodicity. This signal can be approximately estimated by optical flow computation as shown in Figure 1, and is particularly suitable to address our temporal estimation due to: <br />
* the local minima and maxima landmarks probably correspond to characteristic gesture and motion <br />
* it is relatively robust to changes in camera viewpoint.<br />
<br />
The authors have come up with an adaptive method to automatically select the most discriminative video fragments using the density of optical flow energy which exhibits regular periodicity. According to Wikipedia, optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, and optical flow methods try to calculate the motion between two image frames which are taken at different times. The optimal flow energy of an optical field $(v_{x},v_{y})$ is defined as follows <br />
<br />
:<math>e(I)=\underset{(x,y)\in\mathbb{P}}{\operatorname{\Sigma}} ||v_{x}(x,y),v_{y}(x,y)||_{2}</math><br />
<br />
Here, P is the pixel level set of selected interest points. They locate the local minima and maxima landmarks $\{t\}$ of $\epsilon = \{e(I_1),\dots,e(I_t)\}$ and for each two consecutive landmarks create a video fragment $s$ by extracting the frames $s = \{I_{t-1},\dots,I_t\}$.<br />
<br />
[[File:golfswing.png]]<br />
<br />
To deal with the different length of video clip, we adopt the idea of spatial pyramid pooling (SPP) in [12] and extend to temporal domain, developing a volumetric pyramid pooling (VPP) layer to transfer video clip of arbitrary size into a universal length in the last alternative layer before fully connected layer.<br />
<br />
===Alternative Layer===<br />
This is a key layer consisting of a standard volumetric convolutional layer followed by a designed recurrent layer. Volumetric convolutional extracts features from local neighborhoods and a recurrent layer is applied to the output and it proceeds iteratively for T times. The input of a unit at position (x,y,z) in the jth feature map of the ith AL in time t, $u_{ij}^{xyz}(t)$, is given by,<br />
<br />
:<math>u_{ij}^{xyz}(t) = u_{ij}^{xyz}(0) + f(w_{ij}^{r}u_{ij}^{xyz}(t-1)) + b_{ij} \\ <br />
u_{ij}^{xyz}(0) = f(w_{i-1}^{c}u_{(i-1)j}^{xyz}) <br />
</math><br />
<br />
U(0): feed forward output of volumetric convolutional layer. <br />
U(t-1) : recurrent input of previous time <br />
$w_{k}^{c}$ and $w_{k}^{r}$: vectorized feed-forward kernels and recurrent kernels respectively <br />
f: ReLU function followed by a local response normalization (LRN), which mimics the lateral inhibition in the cortex where different features compete for<br />
large responses. <br />
<br />
Figure 3 depicts this structure:<br />
[[File:unfolded.PNG|1000px]]<br />
<br />
The recurrent connections in AL provide two advantages. First, they enable every unit to incorporate contexts in an arbitrarily large region in the current layer。 However, the drawback is that without top-down connections, the states of the units in the current layer cannot be influenced by the context seen by higher-level units; Second, the recurrent connections increase the network depth while keeping the number of adjustable parameters constant by weight sharing, because AL uses only extra constant parameters of a recurrent kernel size.<br />
<br />
===Volumetric Pyramid Pooling Layer===<br />
<br />
[[File:Volumetric Pyramid Pooling Layer.png|thumb|550px|Figure 2: Volumetric Pyramid Pooling Layer]]<br />
The authors have replaced the last pooling layer with a volumetric pyramid pooling layer (VPPL) as we need fixed-length vectors for the fully connected layers and the AL accepts video clips of arbitrary sizes and produces outputs of variable sizes. Figure 2 illustrates the structure of VPPL. The authors have used the max pooling to pool the responses of each kernel in each volumetric bin. The outputs are kM dimensional vectors where:<br />
<br />
M: number of bins <br />
<br />
K: Number of kernels in the last alternative layer.<br />
<br />
This layer structure allows not only for arbitrary-length videos, but also arbitrary aspect ratios and scales.<br />
<br />
It reminds me of the spatial pyramid pooling in deep convolutional networks. In CNN, the dimensions of the training data are the same, so that after convolution, we can train the classifiers effectively. To improve the limit of the same dimension, spatial pyramid pooling is introduced.<br />
<br />
==Overall Architecture== <br />
[[File:DANN Architecture.png|thumb|550px|Figure 3:DANN Architecture]]<br />
The following are the components of the DANN (as shown in Figure 3)<br />
* 6 Alternative layers with 64, 128, 256, 256, 512 and 512 kernel response maps <br />
* 5 ReLU and volumetric pooling layers <br />
* 1 volumetric pyramid pooling layer <br />
* 3 fully connected layers of size 2048 each <br />
* A softmax layer<br />
<br />
==Implementation details==<br />
The authors have used the Torch toolbox platform for implementations of volumetric convolutions, recurrent layers and optimizations. They have used a technique called as random clipping for data augmentation, in which they select a point randomly from the input video of fixed size 80x80xt after determining the temporal size t. This technique is preferred to the common alternative of pre-processing data using a sliding window approach to have pre-segmented clips. The authors cite how using this technique limits the amount of data when the windows are not overlapped with one another. For training the network the authors have used SGD applied to mini-batches of size 30 with a negative log likelihood criterion. Training is done by minimizing the cross-entropy loss function using backpropagation through time algorithm (BPTT). During testing, they applied a video clip divided into 80x80xt clips with a stride of 4 frames followed by testing with 10 crops. Final score is the average of all clip-level scores and the crop scores.<br />
Data augmentation techniques such as the multi-scale cropping method have been evaluated due to the recent success in the state-of-the-art performance displayed by Very Deep Two-stream ConvNets. Going by intuition, the corner cropping strategy could provide better results ( based on trade-off degree) since the receptive fields can focus harder on the central regions of the video frames [7].<br />
<br />
==Evaluations==<br />
===Datasets:===<br />
* The datasets used in the evaluation are UCF101 and HMDB51 <br />
* UCF101 – 13K videos annotated into 101 classes <br />
* HMDB51 – 6.8K videos with 51 actions. <br />
* Three training and test splits are provided <br />
* Performance measured by mean classification accuracy across the splits. <br />
* UCF101 split – 9.5K videos; HMDB51 – 3.7K training videos.<br />
<br />
===Quantitative Results===<br />
The authors used three types of optical flows, viz., sparse, RGB and TVL1 and found that TVL1 is suitable as action recognition is more easy to learn from motion information compared to raw pixel values. The influence of data augmentation is also studied. The baseline being sliding window with 75% overlap, the authors observed that the random clipping and multi-scale clipping outperformed the baseline on the UCF101 split 1 dataset. The authors were able to prove that the adaptive temporal length was able to give a boost of 4.2% when compared with architectures that had fixed-size temporal length. Experiments were also conducted to see if the learnings done in one dataset could improve the accuracy of another dataset. Fine tuning HMDB51 from UCF101 boosted the performance from 56.4% to 62.5%. The authors also observed that increasing the AL layers improves the performance as larger contexts are being embedded into the DANN. The DANN achieved an overall accuracy of 65.9% and 91.6% on HMDB51 and UCF101 respectively.<br />
<br />
<br />
[[File:Performance Comparison of different input modalities.png]]<br />
<br />
===Qualitative Analysis===<br />
The authors have discussed the quality of the prediction in the video clips taking examples of two different scenes involving bowling and haircut. In the bowling scene, the adaptive temporal choice used by DANN could aggregate more reasonable semantic structures and hence it leveraged reasonable video clips as input. On the other hand, the performance on the haircut video clip was not up to the mark as the rich contexts provided by the DANN was not helpful in a setting with simple actions performed in a simple background.<br />
<br />
==Conclusions and Critique==<br />
* Deep alternative neural network is introduced for action recognition.<br />
* The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume.<br />
* DANN consists of volumetric convolutional layer and a recurrent layer. <br />
* A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network.<br />
* The authors have experimented with different datasets like HMDB51 and UCF101 with different scenarios and compared the * * performance of DANN with other approaches. <br />
* The spatial size is still chosen in an ad hoc manner and this can be an area of improvement. <br />
* There are prospects for studying action tube which is a more compact input.<br />
* The paper uses volumetric convolutional layer, but it doesn't say how it is better than recurrent neural networks in exploring temporal information.<br />
* There is no experimental evidence to compare the proposed method with long-term recurrent convolutional network. Also there is no analysis of time complexity of the approach used.<br />
<br />
Github code: https://github.com/wangjinzhuo/DANN<br />
<br />
In the formal review of the paper [https://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html], some interesting criticisms of the paper are surfaced. For starters, one reviewer notes that a similar architecture was proposed in [https://arxiv.org/abs/1511.06432], limiting the novelty of the approach somewhat. The reviewers question the validity of the approach in even slightly more complicated settings (i.e. any non-static camera, which brings in the issue of optical flow). Other criticisms come from a lack of clear motivation for choices that the authors have made, for instance, the use of Local Response Normalization has fallen slightly out-of-favour, or the benefit of using a sliding window approach during testing (and random clips during training).<br />
<br />
Quantitatively, the benefits of the author's approach is not readily apparent. In comparisons with state-of-the-art, the proposed model performs worse on HMDB, and while they claim the highest performance on UCF, the increase is merely .1 over previous best efforts.<br />
<br />
==References==<br />
<br />
[1] Andrej Karpathy, George Toderici, Sachin Shetty, Tommy Leung, Rahul Sukthankar, and Li FeiFei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014 <br />
<br />
[2] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. <br />
<br />
[3]Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015. <br />
<br />
[4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013. <br />
<br />
[5] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015. <br />
<br />
[6]Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. arXiv preprint arXiv:1604.04494, 2016. <br />
<br />
[7]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good Practices for Very Deep Two-Stream ConvNets. arXiv preprint arXiv:1507.02159 , 2015. <br />
<br />
[8] IEEE International Symposium on Multimedia 2013 <br />
<br />
[9] Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action<br />
recognition. arXiv preprint arXiv:1604.04494, 2016<br />
<br />
[10] https://en.wikipedia.org/wiki/Optical_flow<br />
<br />
[11] Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 <br />
<br />
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015.<br />
<br />
[36] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l<br />
1 optical flow. In Pattern Recognition, pages 214–223. 2007.<br />
<br />
A list of expert reviews: http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips29/reviews/480.html</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Exploration_via_Bootstrapped_DQN&diff=31019Deep Exploration via Bootstrapped DQN2017-11-21T18:38:46Z<p>Jdeng: /* Why use Bernoulli? */</p>
<hr />
<div>== Details ==<br />
<br />
'''Title''': Deep Exploration via Bootstrapped DQN<br />
<br />
'''Authors''': Ian Osband {1,2}, Charles Blundell {2}, Alexander Pritzel {2}, Benjamin Van Roy {1}<br />
<br />
'''Organisations''':<br />
# Stanford University<br />
# Google Deepmind<br />
<br />
'''Conference''': NIPS 2016<br />
<br />
'''URL''': [https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn papers.nips.cc]<br />
<br />
'''Online code sources'''<br />
* [https://github.com/iassael/torch-bootstrapped-dqn github.com/iassael/torch-bootstrapped-dqn]<br />
<br />
This summary contains background knowledge from Section 2-7 (except Section 5). Feel free to skip if you already know.<br />
<br />
== Intro to Reinforcement Learning ==<br />
<br />
In reinforcement learning, an agent interacts with an environment with the goal to maximize its long term reward. A common application of reinforcement learning is to the [https://en.wikipedia.org/wiki/Multi-armed_bandit multi armed bandit problem]. In a multi armed bandit problem, there is a gambler and there are $n$ slot machines, and the gambler can choose to play any specific slot machine at any time. All the slot machines have their own probability distributions by which they churn out rewards, but this is unknown to the gambler. So the question is, how can the gambler learn how to get the maximum long term reward?<br />
<br />
There are two things the gambler can do at any instance: either he can try a new slot machine, or he can play the slot machine he has tried before (and he knows he will get some reward). However, even though trying a new slot machine feels like it would bring less reward to the gambler, it is possible that the gambler finds out a new slot machine that gives a better reward than the current best slot machine. This is the dilemma of '''exploration vs exploitation'''. Trying out a new slot machine is '''exploration''', while redoing the best move so far is '''exploiting''' the currently understood perception of the reward.<br />
<br />
[[File:multiarmedbandit.jpg|thumb|Source: [https://blogs.mathworks.com/images/loren/2016/multiarmedbandit.jpg blogs.mathworks.com]]]<br />
<br />
There are many strategies to approach this '''exploration-exploitation dilemma'''. Some [https://web.stanford.edu/class/msande338/lec9.pdf common strategies] for optimizing in an exploration-exploitation setting are Random Walk, Curiosity-Driven Exploration, and Thompson Sampling. A lot of these approaches are provably efficient, but assume that the state space is not very large. For instance, the approach called Curiosity-Driven Exploration aims to take actions that lead to immediate additional information. This requires the model to search “every possible cell in the grid” which is not desirable if state space is very large. Strategies for large state spaces often just either ignore exploration, or do something naive like $\epsilon$-greedy, where you exploit with $1-\epsilon$ probability and explore "randomly" in rest of the cases. The general idea to tackle large or continuous state spaces is by value function approximation. An empirically tested strategy is Value Function Approximation using Fourier Basis [16]. It has also proven to perform well compared to radial basis functions and the polynomial basis, which are the two most popular fixed bases for linear value function approximation. <br />
<br />
This paper tries to use a Thompson sampling like approach to make decisions.<br />
<br />
== Thompson Sampling<sup>[[#References|[1]]]</sup> ==<br />
<br />
In Thompson sampling, our goal is to reach a belief that resembles the truth. Let's consider a case of coin tosses (2-armed bandit). Suppose we want to be able to reach a satisfactory pdf for $\mathbb{P}_h$ (heads). Assuming that this is a Bernoulli bandit problem, i.e. the rewards are $0$ or $1$, we can start off with $\mathbb{P}_h^{(0)}=\beta(1,1)$. The $\beta(x,y)$ distribution is a very good choice for a possible pdf because it works well for Bernoulli rewards. Further $\beta(1,1)$ is the uniform distribution $\mathbb{N}(0,1)$.<br />
<br />
Now, at every iteration $t$, we observe the reward $R^{(t)}$ and try to make our belief close to the truth by doing a Bayesian computation. Assuming $p$ is the probability of getting a heads,<br />
<br />
$$<br />
\begin{align*}<br />
\mathbb{P}(R|D) &\propto \mathbb{P}(D|R) \cdot \mathbb{P}(R) \\<br />
\mathbb{P}_h^{(t+1)}&\propto \mbox{likelihood}\cdot\mbox{prior} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \mathbb{P}_h^{(t)} \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot \beta(x_t, y_t) \\<br />
&\propto p^{R^{(t)}}(1-p)^{R^{(t)}} \cdot p^{x_t-1}(1-p)^{y_t-1} \\<br />
&\propto p^{x_t+R^{(t)}-1}(1-p)^{y_t+R^{(t)}-1} \\<br />
&\propto \beta(x_t+R^{(t)}, y_t+R^{(t)})<br />
\end{align*}<br />
$$<br />
<br />
[[File:thompson sampling coin example.png|thumb||||600px|Source: [https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms Quora]]]<br />
<br />
This means that with successive sampling, our belief can become better at approximating the truth. There are similar update rules if we use a non Bernoulli setting, say, Gaussian. In the Gaussian case, we start with $\mathbb{P}_h^{(0)}=\mathbb{N}(0,1)$ and given that $\mathbb{P}_h^{(t)}\propto\mathbb{N}(\mu, \sigma)$ it is possible to show that the update rule looks like<br />
<br />
$$<br />
\mathbb{P}_h^{(t+1)} \propto \mathbb{N}\bigg(\frac{t\mu+R^{(t)}}{t+1},\frac{\sigma}{\sigma+1}\bigg)<br />
$$<br />
<br />
=== How can we use this in reinforcement learning? ===<br />
<br />
We can use this idea to decide when to explore and when to exploit. We start with an initial belief, choose an action, observe the reward and based on the kind of reward, we update our belief about what action to choose next.<br />
<br />
== Bootstrapping <sup>[[#References|[2,3]]]</sup> ==<br />
<br />
This idea may be unfamiliar to some people, so I thought it would be a good idea to include this. In statistics, bootstrapping is a method to generate new samples from a given sample. Suppose that we have a given population, and we want to study a property $\theta$ of the population. So, we just find $n$ sample points (sample $\{D_i\}_{i=1}^n$), calculate the estimator of the property, $\hat{\theta}$, for these $n$ points, and make our inference. <br />
<br />
If we later wish to find some property related to the estimator $\hat{\theta}$ itself, e.g. we want a bound of $\hat{\theta}$ such that $\delta_1 \leq \hat{\theta} \leq \delta_2$ with a confidence of $c=95%$, then we can use bootstrapping for this.<br />
<br />
Using bootstrapping, we can create a new sample $\{D'_i\}_{i=1}^{n'}$ by '''randomly sampling $n'$ times from $D$, with replacement'''. So, if $D=\{1,2,3,4\}$, a $D'$ of size $n'=10$ could be $\{1,4,4,3,2,2,2,1,3,4\}$. We do this a sufficient $k$ number of times, calculate $\hat{\theta}$ each time, and thus get a distribution $\{\hat{\theta}_i\}_{i=1}^k$. Now, we can choose the $100\cdot c$<sup>th</sup> and $100\cdot(1-c)$<sup>th</sup> percentile of this distribution, (let them be $\hat{\theta}_\alpha$ and $\hat{\theta}_\beta$ respectively) and say<br />
<br />
$$\hat{\theta}_\alpha \leq \hat{\theta} \leq \hat{\theta}_\beta, \mbox{with confidence }c$$<br />
<br />
== Why choose bootstrap and not dropout? ==<br />
<br />
There is previous work<sup>[[#References|[4]]]</sup> that establishes dropout as a good way to train NNs on a posterior such that the trained NN works like a function approximator that is close to the actual posterior. But, there are several problems with the predictions of this trained NN. The figures below are from the appendix of this paper. The left image is the NN trained by the authors of this paper on a sample noisy distribution and the right image is from the accompanying web demo from [[#References|[4]]], where the authors of [[#References|[4]]] show that their NN converges around the mean with a good confidence.<br />
<br />
[[File:dropout_results.png|thumb||center||700px|Source: this paper's appendix]]<br />
<br />
According to the authors of this paper,<br />
# Even though [[#References|[4]]] says that dropout converges arond the mean, their experiment actually behaves weirdly around a reasonable point like $x=0.75$. They think that this happens because dropout only affects the region local to the original data.<br />
# Samples from the NN trained on the original data do not look like a reasonable posterior (very spiky).<br />
# The trained NN collapses to zero uncertainty at the data points from the original data.<br />
<br />
== Q Learning and Deep Q Networks <sup>[[#References|[5]]]</sup> ==<br />
<br />
At any point of time, our rewards dictate what our actions should be. Also, in general, we want good long term rewards. For example, if we are playing a first person shooter game, it is a good idea to go out of cover to kill an enemy, even if some health is lost. Similarly, in reinforcement learning, we want to maximize our long term reward. So if at each time $t$, the reward is $r_t$, then a naive way is to say we want to maximise<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}r_t<br />
$$<br />
<br />
But, this reward is unbounded. So technically it could tend to $\infty$ in a lot of the cases. This is why we use a '''discounted reward'''.<br />
<br />
$$<br />
R_t = \sum_{i=0}^{\infty}\gamma^t r_t<br />
$$<br />
<br />
Here, we take $0\leq \gamma \lt 1$. So, what this means is that we value our current reward the most ($r_0$ has a coefficient of $1$), but we also consider the future possible rewards. So if we had two choices: get $+4$ now and $0$ at all other timesteps, or get $-2$ now and $+2$ after $3$ timesteps for $20$ timesteps, we choose the latter ($\gamma=0.9$). This is because $(+4) < (-2)+0.9^3(2+0.9\cdot2+\cdots+0.9^{19}\cdot2)$.<br />
<br />
<br />
A '''policy''' $\pi: \mathbb{S} \rightarrow \mathbb{A}$ is just a function that tells us what action to take in a given state $s\in \mathbb{S}$. Our goal is to find the best policy $\pi^*$ that maximises the reward from a given state $s$. So, a '''value function''' is defined from $s$ (which the agent is in, at timestep $t$) and following the policy $\pi$ as $V^\pi(s) = \mathbb{E}[R_t]$. The optimal value function is then simply<br />
<br />
$$<br />
V^*(s)=\displaystyle\max_{\pi}V^\pi(s)<br />
$$<br />
<br />
For convenience however, it is better to work with the '''Q function''' $Q: \mathbb{S}\times\mathbb{A} \rightarrow \mathbb{R}$. $Q$ is defined similarly as $V$. It is the expected return after taking an action $a$ in the given state $s$. So, $Q^\pi(s,a)=\mathbb{E}[R_t|s,a]$. The optimal $Q$ function is<br />
<br />
$$<br />
Q^*(s,a)=\displaystyle\max_{\pi}Q^\pi(s,a)<br />
$$<br />
<br />
Suppose that we know $Q^*$. Then, if we know that we are supposed to start at $s$ and take an action $a$ right now, what is the best course of action from the next time step? We just choose the optimal action $a'$ at the next state $s'$ that we reach. The optimal action $a'$ at state $s'$ is simply the argument $a_x$ that maximises our $Q^*(s',\cdot)$.<br />
<br />
$$<br />
a'=\displaystyle\arg\max_{a_x} Q^*(s',a_x)<br />
$$<br />
<br />
So, our best expected reward from $s$ taking action $a$ is $\mathbb{E}[r_t+\gamma\mathbb{E}[R_{t+1}]]$. This is known as the '''Bellman equation''' in optimal control problem (By the way, its continuous form is called '''Hamilton-Jacobi-Bellman equation''' or HJB equation, which is a very important partial differential equation):<br />
<br />
$$<br />
Q^*(s,a)=\mathbb{E}[r_t+\gamma \displaystyle\max_{a_x} Q^*(s',a_x)]<br />
$$<br />
<br />
In Q learning, we use a deep neural network with weights $\theta$ as a function approximator for $Q^*$, since Bellman equation is indeed a non-linear PDE and very difficult to solve numerically. The '''naive way''' to do this is to design a deep neural network that takes as input the state $s$ and action $a$, and produces an approximation to $Q^*$. <br />
<br />
* Suppose our neural net weights are $\theta_i$ at iteration $i$.<br />
* We want to train our neural net on the case when we are at $s$, take action $a$, get reward $r$, and reach $s'$.<br />
* To find out what action is best from $s'$, i.e. $a'$, we have to simulate all actions from $s'$. We can do this after we complete this iteration, then run $s',a_x$ for all $a_x\in\mathbb{A}$. But, we don't know how to complete this iteration without knowing this $a'$. So, another way is to simulate all actions from $s'$ using last known set of weights $\theta_{i-1}$. We just simulate state $s'$, action $a_x$ for all $a_x\in\mathbb{A}$ from the previous state and get $Q^*(s',a_x;\theta_{i-1})$. ('''Note''' that some papers do not use the set of weights from the previous iteration $\theta_{i-1}$. Instead they fix the weights for finding the best action for every $\tau$ steps to $\theta^-$, and do $Q^*(s',a_x;\theta^-)$ for $a_x\in\mathbb{A}$ and use this for the target value.)<br />
* Now we can compute our loss function using the Bellman equation, and backpropagate.<br />
$$<br />
\mbox{loss}=\mbox{target}-\mbox{prediction}=(r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1}))-Q^*(s,a;\theta_i)<br />
$$<br />
<br />
The '''problem''' with this approach is that at every iteration $i$, we have to do $|\mathbb{A}|$ forward passes on the previous set of weights $\theta_{i-1}$ to find out the best action $a'$ at $s'$. This becomes infeasible quickly with more possible actions.<br />
<br />
Authors of [[#References|[5]]] therefore use another kind of architecture. This architecture takes as input the state $s$, and computes the values $Q^*(s,a_x)$ for $a_x\in\mathbb{A}$. So there are $|\mathbb{A}|$ outputs. This basically parallelizes the forward passes so that $r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta_{i-1})$ can be done with just a single pass through the outputs.<br />
<br />
<br />
[[File:DQN_arch.png|thumb||||600px|Source: [https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_7/DQNBreakoutBlocks.png leonardoaraujosantos.gitbooks.io]]]<br />
<br />
'''Note:''' When I say state $s$ as an input, I mean some representation of $s$. Since the environment is a partially observable MDP, it is hard to know $s$. So, we can for example, apply a CNN on the frames and get an idea of what the current state is. We pass this output to the input of the DNN (DNN is the fully connected layer for the CNN then).<br />
<br />
=== Experience Replay ===<br />
<br />
Authors of this paper borrow the concept of experience replay from [[#References|[5,6]]]. In experience replay, we do training in episodes. In each episode, we play and store consecutive $(s,a,r,s')$ tuples in the experience replay buffer. Then after the play, we choose random samples from this buffer and do our training.<br />
<br />
<br />
Advantages of experience replay over simple online Q learning<sup>[[#References|[5]]]</sup>:<br />
* '''Better data efficiency''': It is better to use one transition many times to learn again and again, rather than just learn once from it.<br />
* Learning from consecutive samples is difficult because of correlated data. Experience replay breaks this correlation.<br />
* Online learning means the input is decided by the previous action. So, if the maximising action is to go left in some game, next inputs would be about what happens when we go left. This can cause the optimiser to get stuck in a feedback loop, or even diverge, as [[#Reference|[7]]] points out.<br />
<br />
== Double Q Learning ==<br />
<br />
=== Problem with Q Learning<sup>[[#References|[8]]]</sup> ===<br />
<br />
For a simple neural network, each update tries to shift the current $Q^*$ estimate to a new value:<br />
<br />
$$<br />
Q^*(s,a) \leftarrow r+\gamma\displaystyle\max_{a_x}Q^*(s',a_x)<br />
$$<br />
<br />
Suppose the neural net has some inherent noise $\epsilon$. So, the neural net actually stores a value $\mathbb{Q}^*$ given by<br />
<br />
$$<br />
\mathbb{Q}^* = Q^*+\epsilon<br />
$$<br />
<br />
Even if $\epsilon$ has zero mean in the beginning, using the $\max$ operator at the update steps will start propagating $\gamma\cdot\max \mathbb{Q}^*$. This leads to a non zero mean subsequently. The problem is that "max causes overestimation because it does not preserve the zero-mean property of the errors of its operands." ([[#References|[8]]]) Thus, Q learning is more likely to choose overoptimistic values.<br />
<br />
=== How does Double Q Learning work? <sup>[[#References|[9]]]</sup> ===<br />
<br />
The problem can be solved by using two sets of weights $\theta$ and $\Theta$. The $\mbox{target}$ can be broken up as<br />
<br />
$$<br />
\mbox{target} = r+\displaystyle\max_{a_x}Q^*(s',a_x;\theta) = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta) = r+Q^*(s',a';\theta)<br />
$$<br />
<br />
Using double Q learning, we '''select''' the best action using current weights $\theta$ and '''evaluate''' the $Q^*$ value to decide the target value using $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\Theta) = r+Q^*(s',a';\Theta)<br />
$$<br />
<br />
This makes the evaluation fairer.<br />
<br />
=== Double Deep Q Learning ===<br />
<br />
[[#References|[9]]] further talks about how to use this for deep learning without much additional overhead. The suggestion is to use $\theta^-$ as $\Theta$.<br />
<br />
$$<br />
\mbox{target} = r+Q^*(s',\displaystyle\arg\max_{a_x}Q^*(s',a_x;\theta);\theta^-) = r+Q^*(s',a';\theta^-)<br />
$$<br />
<br />
== Bootstrapped DQN ==<br />
<br />
The authors propose an architecture that has a shared network and $K$ bootstrap heads. So, suppose our experience buffer $E$ has $n$ data points, where each datapoint is a $(s,a,r,s')$ tuple. Each bootstrap head trains on a different buffer $E_i$, where each $E_i$ has been constructed by sampling $n$ data points from the original experience buffer $E$ with replacement ('''bootstrap method''').<br />
<br />
<br />
Because each of the heads train on a different buffer, they model a different $Q^*$ function (say $Q^*_k$). Now, for each episode, we first choose a specific $Q^*_k=Q^*_s$. This $Q^*_s$ helps us create the experience buffer for the episode. From any state $s_t$, we populate the experience buffer by choosing the next action $a_t$ that maximises $Q^*_s$. (similar to '''Thompson Sampling''')<br />
<br />
$$<br />
a_t = \displaystyle\arg\max_a Q^*_s(s_t,a_t)<br />
$$<br />
<br />
Also, along with $s_t,a_t,r_t,s_{t+1}$, they push a bootstrap mask $m_t$. This mask is basically is a binary vector of size $K$, and it tells which $Q_k$ should be affected by this datapoint, if it is chosen as a training point. So, for example, if $K=5$ and there is a experience tuple $(s_t,a_t,r_t,s_{t+1},m_t)$ where $m_t=(0,1,1,0,1)$, then $(s_t,a_t,r_t,s_{t+1})$ should only affect $Q_2,Q_3$ and $Q_5$.<br />
<br />
<br />
So, at each iteration, we just choose few points from this buffer and train the respective $Q_{(\cdot)}$ based on the bootstrap masks.<br />
<br />
=== How to generate masks? ===<br />
<br />
Masks are created by sampling from the '''masking distribution'''. Now, there are many ways to choose this masking distribution:<br />
<br />
* If for each datapoint $D_i$ ($i=1$ to $n$), we mask from $\mbox{Bernoulli}(0.5)$, this will roughly allow us to have half the points from the original buffer. To get to size $n$, we duplicate these points by doubling the weights for each datapoint. This essentially gives us a '''double or nothing''' bootstrap<sup>[[#References|[10]]]</sup>.<br />
* If the mask is $(1, 1 \cdots 1)$, then this becomes an '''ensemble learning''' method.<br />
* $m_t~\mbox{Poi}(1)$ (poisson distribution)<br />
* $m_t[k]~\mbox{Exp}(1)$ (exponential distribution)<br />
<br />
For this paper's results, the authors used a $\mbox{Bernoulli}(p)$ distribution.<br />
<br />
== Related Work ==<br />
<br />
The authors mention the method described in [[#References|[11]]]. The authors of [[#References|[11]]] talk about the principle of "optimism in the face of uncertainty" and modify the reward function to encourage state-action pairs that have not been seen often:<br />
<br />
$$<br />
R(s,a) \leftarrow R(s,a)+\beta\cdot\mbox{novelty}(s,a)<br />
$$<br />
<br />
According to the authors, [[#References|[11]]]'s DQN algorithm relies on a lot of hand tuning and is only good for non stochastic problems. The authors further compare their results to [[#References|[11]]]'s results on Atari.<br />
<br />
<br />
The authors also mention an existing algorithm PSRL<sup>[[#References|[12,13]]]</sup>, or posterior sampling based RL. However, this algorithm requires a solved MDP, which is not feasible for large systems. Bootstrapped DQN approximates this idea by sampling from approximate $Q^*$ functions.<br />
<br />
<br />
Further, the authors mention that the work in [[#References|[12,13]]] has been followed by RLSVI<sup>[[#Reference|[14]]]</sup> which solves the problem for linear cases.<br />
<br />
== Deep Exploration: Why is Bootstrapped DQN so good at it? ==<br />
<br />
The authors consider a simple example to demonstrate the effectiveness of bootstrapped DQN at deep exploration.<br />
<br />
[[File:deep_exploration_example.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
In this example, the agent starts at $s_2$. There are $N$ steps, and $N+9$ timesteps to generate the experience buffer. The agent is said to have learned the optimal policy if it achieves the best possible reward of $10$ (go to the rightmost state in $N-1$ timesteps, then stay there for $10$ timesteps), for at least $100$ such episodes. The results they got:<br />
<br />
[[File:deep_exploration_results.png|thumb||center||700px|Source: this paper, section 5.1]]<br />
<br />
<br />
<br />
The blue dots indicate when the agent learnt the optimal policy. If this took more than $2000$ episodes, they indicate it with a red dot. Thompson DQN is DQN with posterior sampling at every timestep. Ensemble DQN is same as bootstrapped DQN except that the mask is all $(1,1 \cdots 1)$. It is evident from the graphs that bootstrapped DQN can achieve deep exploration better than these two methods, and DQN.<br />
<br />
=== But why is it better? ===<br />
<br />
The authors say that this is because bootstrapped DQN constructs different approximations to the posterior $Q^*$ with the same initial data. This diversity of approximations is because of random initalization of weights for the $Q^*_k$ heads. This means that these heads start out trying random actions (because of diverse random initial $Q^*_k$), but when some head finds a good state and generalises to it, some (but not all) of the heads will learn from it, because of the bootstrapping. Eventually other heads will either find other good states, or end up learning the best good states found by the other heads.<br />
<br />
<br />
So, the architecture explores well and once a head achieves the optimal policy, eventually, all heads achieve the policy.<br />
<br />
== Results ==<br />
<br />
The authors test their architecture on 49 Atari games. They mention that there has been recent work to improve the performance of DDQNs, but those are tweaks whose intentions are orthogonal to this paper's idea. So, they don't compare their results with them.<br />
<br />
=== Scale: What values of $K$, $p$ are best? ===<br />
<br />
[[File:scale_k_p.png|thumb||center||800px|Source: this paper, section 6.1]]<br />
<br />
Recall that $K$ is the number of bootstrap heads and $p$ is the parameter for the masking distribution (Bernoulli). The authors say that around $K=10$, the performance reaches close to the peak, so it should be good.<br />
<br />
<br />
$p$ also represents the amount of data sharing. This is because lesser $p$ means there is lesser chance (due to the Bernoulli distribution) that the corresponding datapoint is taken into the bootstrapped dataset $D_i$. So, lesser $p$ means more identical datapoints, hence more heads share their datapoints. However, the value of $p$ doesn't seem to affect the rewards achieved over time. The authors give the following reasons for it:<br />
<br />
* The heads start with random weights for $Q^*$, so the targets (which use $Q^*$) turn out to be different. So the update rules are different.<br />
* Atari is deterministic.<br />
* Because of the initial diversity, the heads will learn differently even if they predict the same action for the given state.<br />
<br />
$p=1$ is the value they use finally, because this reduces the no. of identical datapoints and reduces time.<br />
<br />
=== Performance on Atari ===<br />
<br />
In general, the results tell us that bootstrapped DQN achieves better results.<br />
<br />
[[File:atari_results_bootstrapped_dqn.png|thumb||center||800px|Source: this paper, section 6.2]]<br />
<br />
The authors plot the improvement they achieved with bootstrapped DQN with the games. They define '''improvement''' to be $x$ if bootstrapped DQN achieves a better result than DQN in $\frac{1}{x}$ frames.<br />
<br />
[[File:bdqn_improvement.png|thumb||center||1000px|Source: this paper, section 6.2]]<br />
<br />
<br />
The authors say that bootstrapped DQN doesn't work good on all Atari games. They point out that there are some challenging games, where exploration is key but bootstrapped DQN doesn't do good enough (but does better than DQN). Some of these games are Frostbite and Montezuma’s Revenge. They say that even better exploration may help, but also point out that there may be other problems like: network instability, reward clipping and temporally extended rewards.<br />
<br />
=== Improvement: Highest Score Reached & how fast is this high score reached? ===<br />
<br />
The authors plot the improvement graphs after 20m and 200m frames.<br />
<br />
[[File:cumulative_rewards_bdqn.png|thumb||center||700px|Source: this paper, section 6.3]]<br />
<br />
=== Visualisation of Results ===<br />
<br />
One of the authors' [https://www.youtube.com/playlist?list=PLdy8eRAW78uLDPNo1jRv8jdTx7aup1ujM youtube playlist] can be found online.<br />
<br />
<br />
The authors also point out that just purely using bootstrapped DQN as an exploitative strategy is pretty good by itself, better than vanilla DQN. This is because of the deep exploration capabilities of bootstrapped DQN, since it can use the best states it knows and also plan to try out states it doesn't have any information about. Even in the videos, it can be seen that the heads agree at all the crucial decisions, but stay diverse at other less important steps.<br />
<br />
== Critique ==<br />
<br />
It would be very interesting and a great addition to the the experimental section of the paper, if the authors would have compared with asynchronous methods of exploration of the state space first introduced in [[#References|[15]]]. The authors unfortunately only compared their DQN with the original DQN and not all the other variations in the literature, and justified it by saying that their idea was "orthogonal" to these improvements.<br />
<br />
=== Different way to do exploration-exploitation? ===<br />
<br />
Instead of choosing the next action $a_t$ that maximises $Q^*_s$, they could have chosen different actions $a_i$ with probabilities<br />
<br />
$$<br />
\mathbb{P}(s_t,a_i) = \frac{Q^*_s(s_t,a_i)}{\displaystyle \sum_{i=1}^{|\mathbb{A}|} Q^*_s(s_t,a_i)}<br />
$$<br />
<br />
According to me, this is closer to Thompson Sampling.<br />
<br />
=== Why use Bernoulli? ===<br />
<br />
The choice of having a Bernoulli masking distribution eventually doesn't help them at all, since the algorithm does good because of the initial diversity. Maybe they can use some other masking distribution? However, the bootstrapping procedure is distribution-independent, the choice of masking distribution should not effect the long term performance of Bootstrapped DQN.<br />
<br />
=== Unanswered Questions & Miscellaneous ===<br />
* The Thompson DQN is not preferred because other randomized value functions can implement settings similar to Thompson sampling without the need for an intractable exact posterior update and also by working around the computational issue with Thompson Sampling: resampling every time step. Perhaps the authors could have explored Temporal Difference learning which is an attempt at combining Dynamic Programming and Monte Carlo methods.<br />
* The actual algorithm is hidden in the appendix. It could have been helpful if it were in the main paper.<br />
<br />
== References ==<br />
<br />
# [https://bandits.wikischolars.columbia.edu/file/view/Lecture+4.pdf Learning and optimization for sequential decision making, Columbia University, Lec 4]<br />
# [https://www.thoughtco.com/what-is-bootstrapping-in-statistics-3126172 Thoughtco, What is bootstrapping in statistics?]<br />
# [https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf Bootstrap confidence intervals, Class 24, 18.05, MIT Open Courseware]<br />
# [https://arxiv.org/abs/1506.02142 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015.]<br />
# [https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Mnih et al., Playing Atari with Deep Reinforcement Learning, 2015]<br />
# Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.<br />
# John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. Automatic Control, IEEE Transactions on, 42(5):674–690, 1997.<br />
# S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning, 1993.<br />
# [https://arxiv.org/pdf/1509.06461.pdf Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015.]<br />
# [https://pdfs.semanticscholar.org/d623/c2cbf100d6963ba7dafe55158890d43c78b6.pdf Dean Eckles and Maurits Kaptein, Thompson Sampling with the Online Bootstrap, 2014, Pg 3]<br />
# [https://arxiv.org/abs/1507.00814 Bradly C. Stadie, Sergey Levine, Pieter Abbeel, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2015.]<br />
# Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling, NIPS 2013.<br />
# Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension, NIPS 2014.<br />
# [https://arxiv.org/abs/1402.0635 Ian Osband, Benjamin Van Roy, Zheng Wen, Generalization and Exploration via Randomized Value Functions, 2014.]<br />
# Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.<br />
# George Konidaris, Sarah Osentoski, and Philip Thomas. 2011. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI'11). AAAI Press 380-385.<br />
<br />
<br />
Other helpful links (unsorted):<br />
* [http://pemami4911.github.io/paper-summaries/deep-rl/2016/08/16/Deep-exploration.html pemami4911.github.io]<br />
* [http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.pdf Poisson Approximations]<br />
<br />
== Appendix ==<br />
<br />
=== Algorithm for Bootstrapped DQN ===<br />
The appendix lists the following algorithm. Periodically, the replay buffer is played back to update value function network Q.<br />
<br />
[[File:alg1.PNG|thumb||left||700px|Source: this paper's appendix]]</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=LightRNN:_Memory_and_Computation-Efficient_Recurrent_Neural_Networks&diff=30620LightRNN: Memory and Computation-Efficient Recurrent Neural Networks2017-11-16T18:47:47Z<p>Jdeng: /* Part III: Bootstrap for Word Allocation */</p>
<hr />
<div>= Introduction =<br />
<br />
The study of natural language processing has been around for more than fifty years. It begins in the 1950s when the specific field of natural language processing (NLP) is still embedded in the subject of linguistics (Hirschberg & Manning, 2015). After the emergence of strong computational power, computational linguistics began to evolve and gradually branch out to various applications in NLP, such as text classification, speech recognition and question answering (Brownlee, 2017). Computational linguistics or natural language processing is usually defined as “subfield of computer science concerned with using computational techniques to learn, understand, and produce human language content” (Hirschberg & Manning, 2015, p. 261). <br />
<br />
With the development of deep neural networks, one type of neural network, namely recurrent neural networks (RNN) have performed significantly well in many natural language processing tasks. One of the first examples applies RNN to speech recognition tasks (Mikolov et. al. 2010). The reason is that nature of RNN takes into account the past inputs as well as the current input without resulting in vanishing or exploding gradient. More detail of how RNN works in the context of NLP will be discussed in the section of NLP using RNN. However, one limitation of RNN used in NLP is its enormous size of input vocabulary (i.e. the vocabulary is too large to compute). This will result in a very complex RNN model with too many parameters to train and makes the training process both time and memory-consuming. This serves as the major motivation for this paper’s authors to develop a new technique utilized in RNN, which is particularly efficient at processing a large vocabulary in many NLP tasks, namely LightRNN. In this work, the authors propose "LightRNN" which uses two-component (2C) shared word embedding for word representations.<br />
<br />
= Motivations =<br />
<br />
In language modelling, researchers used to represent words by arbitrary codes, such as “Id143” is the code for “dog” (“Vector Representations of Words,” 2017). Such coding of words is completely random, and it loses the meaning of the words and (more importantly) connection with other words. Nowadays, one-hot representation of words is commonly used, in which a word is represented by a vector of numbers and the dimension of the vector is related to the size of the vocabulary. In RNN, all words in the vocabulary are coded using one-hot representation and then mapped to an embedding vector (Li, Qin, Yang, Hu, & Liu, 2016). Such embedding vector is “a continuous vector space where semantically similar words are mapped to nearby points” (“Vector Representations of Words” 2017, para. 6). Popular RNN structure used in NLP task is long short-term memory (LSTM). In order to predict the probability of the next word, the last hidden layer of the network needs to calculate the probability distribution over all other words in the vocabulary. Note that the most time-consuming operation in RNNs is to calculate a probability distribution over all the words in the vocabulary, which requires the multiplication of the output-embedding matrix and the hidden state at each position of a sequence. Lastly, an activation function (commonly, softmax function) is used to select the next word with the highest probability. <br />
This method has 3 major limitations:<br />
<br />
* Memory Constraint <br />
:: When input vocabulary contains an enormous amount of unique words, which is very common in various NLP tasks, the size of model becomes very large. This means the number of trainable parameters is very big, which makes it difficult to fit such model on a regular GPU device.<br />
<br />
* Computationally Heavy to Train<br />
:: As previously mentioned, the probability distribution of all other words in the vocabulary needs to be computed to determine what predicted word it would be. When the size of the vocabulary is large, such calculation can be computationally heavy. <br />
<br />
* Low Compressibility<br />
:: Due to the memory and computation-consuming process of RNN applied in NLP tasks, mobile devices cannot usually handle such algorithm, which makes it undesirable and limits its usage.<br />
<br />
Previously, there were some works focusing on reducing the computing complexity in Softmax layer. By building a hierarchical binary tree where each node stands for a word, the time complexity is reduced to $\log(|V|)$. However, the space complexity remains same. In addition, some technicals, such as Character-level convolution filters, tried to reduce the model size by shrinking the input-embedding matrix, whereas brings no improvement in terms of speed.<br />
<br />
An alternative approach to handle the overhead is by leveraging weak lower-level learners by Boosting. But a drawback is that this technique has only been implemented for a few specific tasks in the past such as Time Series predictions [Boné et. al.].<br />
<br />
= LightRNN Structure =<br />
<br />
The authors of the paper proposed a new structure that effectively reduces the size of the model by arranging all words in the vocabulary into a word table, which is referred as “2-Component (2C) shared embedding for word representation”. This is done by factorizing a vocabulary's embedding into two shared components (row and column). Thus, a word is indexed by its location in such table, which in terms is characterized by the corresponding row and column components. Each row and column component are unique row vector and column vector respectively. By organizing each word in the vocabulary in this manner, multiple words can share the same row component or column component and it can reduce the number of trainable parameters significantly. <br />
The next question is how to construct such word table. More specifically, how to allocate each word in the vocabulary to different positions so that semantically similar words are in the same row or column. The authors proposed a bootstrap method to solve this problem. Essentially, we first randomly distribute words into the table. Then, we let the model “learn” better position of each word by minimizing training error. By repeating this process, each word can be allocated to a particular position within the table so that similar words share common row or column components. More details of those 2 parts of LightRNN structure will be discussed in the following sections.<br />
<br />
There are 2 major benefits of the proposed technique:<br />
<br />
* Computationally efficient<br />
<br />
:: The name “LightRNN” is to illustrate the small model size and fast training speed. Because of these features of the new RNN architecture, it’s possible to launch such model onto regular GPU and other mobile devices. <br />
<br />
* Higher scalability <br />
<br />
:: The authors briefly explained this algorithm is scalable because if parallel-computing is needed to train such model, the difficulty of combining smaller models is low. <br />
<br />
<br />
== Part I: 2-Component Shared Embedding ==<br />
<br />
The key aspect of LightRNN structure is its innovative method of word representation, namely 2-Component Shared Embedding. All words in the vocabulary are organized into a table with row components and column components. Each pair of the element in a row component and a column component is corresponding to a unique word in the vocabulary. For instance, the <math>i^{th}</math> row and <math>j^{th}</math> column are the row and column indexes for <math>X_{ij}</math>. As shown in the following graph, <math>x_{1}</math> is corresponding to the words “January”. In 2C shared embedding table, it’s indexed by 2 elements: <math>x^{r}_{1}</math> and <math>x^{c}_{1}</math> where the subscript indicates which row component and column component this word belongs to. Ideally, words that share similar semantic features should be assigned to the same row or column. The shared embedding word table in Figure 1 serves as a good example: the word “one” and “January” are assigned to the same column, while the word “one” and “two” are allocated to the same row. <br />
<br />
[[File:2C shared embedding.png|700px|thumb|centre|Fig 1. 2-Component Shared Embedding for Word Representation]]<br />
<br />
The main advantage of using such word representation is it reduces the number of vector/element needed for input word embedding. For instance, if there are 25 unique words in the vocabulary, the number of vectors to represent all the words is 10, namely 5 row vectors/elements and 5 column vectors/elements. Therefore, the shared embedding word table is a 5 by 5 matrix. In general, the formula for calculating number of vector/element needed to represent <math>|V|</math> words is <math>2\sqrt{|V|}</math>.<br />
<br />
== Part II: How 2C Shared Embedding is Used in LightRNN ==<br />
<br />
After constructing such word representation table, those 2-component shared embedding matrices are fed into the recurrent neural network. The following Figure 2 demonstrates a portion of LightRNN structure (left) with comparison with the regular RNN (right). Compared to regular RNN where a single input <math>x_{t-1}</math> is fed into the network each time, 2 elements of a single input <math>x_{t-1}</math>: <math>x^{r}_{t-1}</math> and <math>x^{c}_{t-1}</math> are fed into LightRNN. With the 2-Component shared embedding, we can construct the LightRNN model by doubling the basic units of a vanilla RNN model. If <math>n , m</math> denote the dimension of a row/column input vector and that of a hidden state vector respectively. To compute the probability distribution of $w_t$, we need to use the column vector <math> x_{t−1}^c ∈ R^n</math> the row vector <math>x_t^r ∈ R^n</math>, and the hidden<br />
state vector <math>h_{t−1}^r ∈ R^m</math>.<br />
<br />
[[File:LightRNN.PNG |700px|thumb|centre|Fig 2. LightRNN Structure & Regular RNN]]<br />
<br />
As mentioned before, the last hidden layer will produce the probabilities of <math>word_{t}</math>. Based on the diagram below, the following formulas are used:<br />
Let $n$ be the dimension/length of a row input vector/a column input vector, <math>X^{c}, X^{r} \in \mathbb{R}^{n \times \sqrt{|V|}}</math> denotes the input-embedding matrices: <br />
<center><br />
: row vector <math>x^{r}_{t-1} \in \mathbb{R}^n</math><br />
: column vector <math>x^{c}_{t-1} \in \mathbb{R}^n</math><br />
</center><br />
<br />
Let <math>h^{c}_{t-1}, h^{r}_{t-1} \in \mathbb{R}^m</math> denotes the two hidden layers where m = dimension of the hidden layer:<br />
<center><br />
: <math>h^{c}_{t-1} = f(W x_{t-1}^{c} + U h_{t-1}^{r} + b) </math><br />
: <math>h^{r}_{t} = f(W x_{t}^{r} + U h_{t-1}^{c} + b) </math><br />
</center><br />
where <math>W \in \mathbb{R}^{m \times n}</math>, <math>U \in \mathbb{R}^{m \times m}</math>, and <math>b \in \mathbb{R}^m</math> and <math>f</math> is a nonlinear activation function<br />
<br />
The final step in LightRNN is to calculate <math>P_{r}(w_{t})</math> and <math>P_{c}(w_{t})</math> , which means the probability of a word w at time t, using the following formulas:<br />
<center><br />
: <math>P_{r}(w_t) = \frac{exp(h_{t-1}^{c} y_{r(w)}^{r})}{\sum\nolimits_{i \in S_r} exp(h_{t-1}^{c} y_{i}^{r}) }</math><br />
: <math>P_{c}(w_t) = \frac{exp(h_{t}^{r} y_{c(w)}^{c})}{\sum\nolimits_{i \in S_c} exp(h_{t}^{r} y_{i}^{c}) }</math><br />
: <math> P(w_t) = P_{r}(w_t) P_{c}(w_t) </math> <br />
</center><br />
where <br />
<center><br />
:<math> r(w) </math> = row index of word w <br />
:<math> c(w) </math> = column index of word w<br />
:<math> y_{i}^{r} \in \mathbb{R}^m </math> = i-th vector of <math> Y^r \in \mathbb{R}^{m \times \sqrt{|V|}}</math> <br />
:<math> y_{i}^{c} \in \mathbb{R}^m </math> = i-th vector of <math> Y^c \in \mathbb{R}^{m \times \sqrt{|V|}}</math><br />
:<math> S_r </math> = the set of rows of the word table<br />
:<math> S_c </math> = the set of columns of the word table<br />
</center><br />
<br />
We can see that by using above equation, we effectively reduce the computation of the probability of the next word from a $|V|$-way normalization (in standard RNN models) to two $\sqrt {|V|}$-way normalizations. Note that we don't see the t-th word before predicting it. So in the above diagram, given the input column vector <math>x^c_{t-1} </math> of the (t-1)-th word, we first infer the row probability <math>P_r(w_t)</math> of the t-th word, and then choose the index of the row the largest probability in <math>P_r(w_t)</math> to look up the next input row vector <math>x^r_{t} </math>. Similarly, we can infer the column probability <math>P_c(w_t)</math> of the t-th word. <br />
<br />
Essentially, in LightRNN, the prediction of the word at time t (<math> w_t </math>) based on word at time t-1 (<math> w_{t-1} </math>) is achieved by selecting the index <math> r </math> and <math> c </math> with the highest probabilities <math> P_{r}(w_t) </math>, <math> P_{c}(w_t) </math>. Then, the probability of each word is computed based on the multiplication of <math> P_{r}(w_t) </math> and <math> P_{c}(w_t) </math>.<br />
<br />
== Part III: Bootstrap for Word Allocation ==<br />
<br />
As mentioned before, the major innovative aspect of LightRNN is the development of 2-component shared embedding. Such structure can be used in building a recurrent neural network called LightRNN. However, how such word table representation is constructed is the key part of building a successful LightRNN model. In this section, the procedures of constructing 2C shared embedding structure is explained. <br />
The fundamental idea is using bootstrap method by minimizing a loss function (namely, negative log-likelihood function). The detailed procedures are described as the following:<br />
<br />
Step 1: First, all words in a vocabulary are randomly assigned to individual position within the word table<br />
<br />
Step 2: Train LightRNN model based on word table produced in step 1 until certain criteria are met<br />
<br />
Step 3: By fixing the training results of input and output embedding matrices (W & U) from step 2, adjust the position of words by minimizing the loss function over all the words. Then, repeat from step 2<br />
<br />
Given a context with $T$ words, the authors presented the overall loss function for word w moving to position [i, j] using a negative log-likelihood function (NLL) as the following:<br />
<center><br />
<math> NLL = \sum\limits_{t=1}^T -logP(w_t) = \sum\limits_{t=1}^T -log[P_{r}(w_t) P_{c}(w_t)] = \sum\limits_{t=1}^T -log[P_{r}(w_t)] – log[P_{c}(w_t)] = \sum\limits_{w=1}^{|V|} NLL_w </math><br />
</center><br />
where <math> NLL_w </math> is the negative log-likelihood of a word w. <br />
<br />
Since in 2-component shared embedding structure, a word (w) is represented by one row vector and one column vector, <math> NLL_w </math> can be rewritten as <math> l(w, r(w), c(w)) </math> where <math> r(w) </math> and <math> c(w) </math> are the position index of word w in the word table. Next, the authors defined 2 more terms to explain the meaning of <math> NLL_w </math>: <math> l_r(w,r(w)) </math> and <math> l_c(w,c(w)) </math>, namely the row component and column component of <math> l(w, r(w), c(w)) </math>. The above can be summarised by the following formulas: <br />
<center><br />
<math> NLL_w = \sum\limits_{t \in S_w} -logP(w_t) = l(w, r(w), c(w)) </math> <br><br />
<math> = \sum\limits_{t \in S_w} -logP_r(w_t) + \sum\limits_{t \in S_w} -logP_c(w_t) = l_r(w,r(w)) + l_c(w,c(w))</math> <br><br />
<math> = \sum\limits_{t \in S_w} -log (\frac{exp(h_{t-1}^{c} y_{i}^{r})}{\sum\nolimits_{k} exp(h_{t-1}^{c} y_{i}^{k})}) + \sum\limits_{t \in S_w} -log (\frac{exp(h_{t}^{r} y_{j}^{c})}{\sum\nolimits_{k} exp(h_{t}^{r} y_{k}^{c}) }) </math> <br> <br />
where <math> S_w </math> is the set of all possible positions for the word w in the corpus </center><br />
In summary, the overall loss function for word w to move to position [i, j] is the sum of its row loss and column loss of moving to position [i, j]. Therefore, total loss of moving to position [i, j] <math> l(w, i, j) = l_r(w, i) + l_c(w, j)</math>. Thus, to update the table by reallocating each word, we are looking for position [i, j] for each word w that minimize the total loss function, mathematically written as for the following:<br />
<center><br />
<math> \min\limits_{a} \sum\limits_{w,i,j} l(w,i,j)a(w,i,j) </math> such that <br><br />
<math> \sum\limits_{(i,j)} a(w,i,j) = 1 \space \forall w \in V, \sum\limits_{(w)} a(w,i,j) = 1 \space \forall i \in S_r, j \in S_j</math> <br><br />
<math> a(w,i,j) \in {0,1}, \forall w \in V, i \in S_r, j \in S_j</math> <br><br />
where <math> a(w,i,j) =1 </math> indicates moving word w to position [i, j]<br />
</center><br />
<br />
After calculating $l(w, i, j)$ for all possible $w, i, j$, the above optimization leads forcing $a(w, i, j)$ to be equal to 1 for $i, j$ in which $l(w, i, j)$ is minimum and 0 elsewhere (i.e. finding the best place for the word $w$ in the table). This minimization problem is an classical assignment problem, which can be solved in polynomial time $O(|V|^3)$. So word table representation allocation would not occupy much computational time.<br />
<br />
= LightRNN Example =<br />
<br />
After describing the theoretical background of the LightRNN algorithm, the authors applied this method to 2 datasets (2013 ACL Workshop Morphological Language Dataset (ACLW) & One-Billion-Word Benchmark Dataset (BillonW)) and compared its performance with several other state-of-the-art RNN algorithms. The following table shows some summary statistics of those 2 datasets:<br />
<br />
[[File:Table1YH.PNG|700px|thumb|centre|Table 1. Summary Statistics of Datasets]]<br />
<br />
The goal of a probabilistic language model is either to compute the probability distribution of a sequence of given words (e.g. <math> P(W) = P(w_1, w_2, … , w_n)</math>) or to compute the probability of the next word given some previous words (e.g. <math> P(w_5 | w_1, w_2, w_3, w_4)</math>) (Jurafsky, 2017). In this paper, the evaluation matrix for the performance of LightRNN algorithm is perplexity <math> PPL </math> which is defined as the following: <br />
<center><br />
<math> PPL = exp(\frac{NLL}{T})</math> <br><br />
where T = number of tokens in the test set<br />
</center><br />
<br />
Based on the mathematical definition of PPL, a well-performed model will have a lower perplexity. <br />
The authors then trained “LSTM-based LightRNN using stochastic gradient descent with truncated backpropagation through time” (Li, Qin, Yang, Hu, & Liu, 2016). To begin with, the authors first used the ACLW French dataset to determine the size of embedding matrix. From the results shown in Table 2, larger embedding size corresponds to higher accuracy rate (expressed in terms of perplexity). Therefore, they adopted embedding size of 1000 to be used in LightRNN to analyze the ACLW datasets. <br />
<br />
[[File:Table2YH.PNG|700px|thumb|centre|Table 2. Testing PPL of LightRNN on ACLW-French dataset w.r.t. embedding size]]<br />
<br />
* In the official implement Github repo, Figure 3 shows the training process of LightRNN on ACLW-French dataset.<br />
[[File:ACLWFR.png|700px|thumb|centre|Figure 3.. Training process on ACLW-French]]<br />
<br />
'''Advantage 1: small model size'''<br />
<br />
One of the major advantages of using LightRNN on NLP tasks is significantly reduced model size, which means fewer number of parameters to estimate. By comparing LightRNN with two other RNN algorithms and the baseline language model with Kneser-Ney smoothing. Those two RNN algorithms are: HSM which uses LSTM RNN algorithm with hierarchical softmax for word prediction; C-HSM which uses both hierarchical softmax and character-level convolutional filters for input embedding. From the results table shown below, we can see that LightRNN has the lowest perplexity while keeping the model size significantly smaller compared to the other three algorithms. <br />
<br />
[[File:Table5YH.PNG|700px|thumb|centre|Table 3. PPL Results in test set on ACLW datasets]]<br />
Italic results are the previous state-of-the-art. #P denotes the number of parameters. <br />
<br />
'''Advantage 2: high training efficiency'''<br />
<br />
Another advantage of LightRNN model is its shorter training time while maintaining same level of perplexity compared to other RNN algorithms. When comparing to both C-HSM and HSM (shown below in Table 4), LightRNN only takes half the runtime but achieve same level of perplexity when applied to both ACLW and BillionW datasets. In the last column of Table 3, the amount of time used for word table reconstruction is presented as the percentage of the total runtime. As we can see, the training time for word reallocation takes up only a very small proportion of the total runtime. However, the resulting reconstructed word table can be used as a valuable output, which is further explained in the next section. <br />
<br />
[[File:Table3YH.PNG|700px|thumb|centre|Table 4. Runtime comparisons in order to achieve the HSMs’ baseline PPL]]<br />
<br />
<br />
'''Advantage 3: semantically valid word allocation table'''<br />
<br />
As explained in the previous section, LightRNN uses a word allocation table that gets updated in every iteration of the algorithm. The optimal structure of the table should assign semantically similar words onto the same row or column in order to reduce the number of parameters to estimate. Below is a snapshot of the reconstructed word table used in LightRNN algorithm. Evidently, we can see in row 887, all URL addresses are grouped together and in row 872 all verbs in past tense are grouped together. As the authors explained in the paper, LightRNN doesn’t assume independence of each word but instead using a shared embedding table. In this way, it reduces the model size by utilizing common embedding elements of the table/matrix, and also uses such preprocessed data to improve the efficiency of this algorithm.<br />
<br />
[[File:Table6YH.PNG|700px|thumb|centre|Table 6. Sample Word Allocation Table]]<br />
<br />
= Remarks =<br />
<br />
In summary, the proposed method in this paper is mainly on developing a new way of using word embedding which is natural extension of an 1-layer word embedding look-up table towards a 2-layer look-up table. Words with similar semantic meanings are embedded using similar vectors. Those vectors are then divided into row and column components where similar words are grouped together by having shared row and column components in the word representation table. The bootstrap step is promising since it learns a good word allocation (similar to word clustering). There could be a large impact on various natural language applications. From a computational and application perspective, there were two key contributions provided in this paper. <br />
<br />
# Reduction in size of word embedding matrix. <br />
# Reduction in computations of word probabilities. <br />
<br />
The proposed model makes no assumptions about the structure of the words, which makes it potentially useful outside of NLP. In contrast, character-based word embedding models also reduce the model size, but do need to access the internal structure of words (i.e. their characters). These two points ensures that one does not need hierarchical softmax or Monte carlo estimations of the model's training cost.<br />
This is indeed a dimensional reduction, i.e. use the row and column "semantic vectors" to approximate the coded word. Because of this structural change of input word embedding, RNN model needs to adapt by having both row and column components being fed into the network. However, the fundamental structure of RNN model does not change. Therefore, personally, I would say it’s a new word embedding technique rather than a new development in model construction. One major confusion I have when reading this paper is how those row and column components in the word allocation table are determined. From the paper itself, the authors didn’t explain how they are constructed. <br />
<br />
<br />
Such shared word embedding technique is prevalently used in NLP. For instance, in language translation, similar words from different languages are grouped together so that the machine can translate sentences from one language to another. In Socher et al. (2013a), English and Chinese words are embedded in the same space so that we can find similar English (Chinese) words for Chinese (English) words. (Zou, Socher, Cer, & Manning, 2013). Word2vec is also a commonly used technique for word embedding, which uses a two-layer neural network to transform text into numeric vectors where similar words will have similar numeric values. The key feature of word2vec is that semantically similar words (which is now represented by numeric vectors) can be grouped together (“Word2vec,” n.d.; Bengio, Ducharme, & Vincent, 2001; Bengio, Ducharme, Vincent, & Jauvin, 2003).<br />
<br />
An interesting area of further exploration proposed by the authors is an extension of this method to k-component shared embeddings where k>2. Words probably share similar semantic meanings in more than two dimensions, and this extension could reduce network size even further. However, it could also further complicate the bootstrapping phase of training.<br />
<br />
Since no assumptions were made about the structure of the words, one could seek uses of this algorithm outside the context of natural language processing. <br />
<br />
Overall, two-component embedding approach is interesting. However, the reported numbers on the one-billion word benchmark are worse than the best results reported in (Chelba et al 2013). In addition, the authors don't report run times so we can't figure out how much additional training time is added by the table allocation optimizer.<br />
<br />
Code for LightRNN can be found on Github : <br />
<br />
Official Implementation(CNTK): https://github.com/Microsoft/CNTK/tree/master/Examples/Text/LightRNN<br />
<br />
Tensorflow : https://github.com/YisenWang/LightRNN-NIPS2016-Tensorflow_code<br />
<br />
= Reference =<br />
Bengio, Y, Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. In Journal of Machine Learning Research (Vol. 3, pp. 932–938). https://doi.org/10.1162/153244303322533223<br />
<br />
Bengio, Yoshua, Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb), 1137–1155.<br />
<br />
Brownlee, J. (2017, September 20). 7 Applications of Deep Learning for Natural Language Processing. Retrieved October 27, 2017, from https://machinelearningmastery.com/applications-of-deep-learning-for-natural-language-processing/<br />
<br />
Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266. https://doi.org/10.1126/science.aaa8685<br />
<br />
Jurafsky, D. (2017, January). Language Modeling Introduction to N grams. Presented at the CS 124: From Languages to Information, Stanford University. Retrieved from https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf<br />
<br />
Li, X., Qin, T., Yang, J., Hu, X., & Liu, T. (2016). LightRNN: Memory and Computation-Efficient Recurrent Neural Networks. Advances in Neural Information Processing Systems 29, 4385–4393.<br />
<br />
Recurrent Neural Networks. (n.d.). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/recurrent<br />
<br />
Vector Representations of Words. (2017, August 17). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/word2vec<br />
<br />
Word2vec. (n.d.). Retrieved October 26, 2017, from https://deeplearning4j.org/word2vec.html<br />
<br />
Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation, 1393–1398.<br />
<br />
Kneser Ney Smoothing - : https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing & http://www.foldl.me/2014/kneser-ney-smoothing/<br />
<br />
Boné R., Assaad M., Crucianu M. (2003) Boosting Recurrent Neural Networks for Time Series Prediction. In: Pearson D.W., Steele N.C., Albrecht R.F. (eds) Artificial Neural Nets and Genetic Algorithms. Springer, Vienna<br />
<br />
Mikolov T., Karafiat M., Burget L., Cernocky J. H., Khudanpur S. Recurrent neural network based language model. Interspeech 2010.<br />
<br />
Chelba et al 2013, One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=LightRNN:_Memory_and_Computation-Efficient_Recurrent_Neural_Networks&diff=30619LightRNN: Memory and Computation-Efficient Recurrent Neural Networks2017-11-16T18:46:27Z<p>Jdeng: /* Part III: Bootstrap for Word Allocation */</p>
<hr />
<div>= Introduction =<br />
<br />
The study of natural language processing has been around for more than fifty years. It begins in the 1950s when the specific field of natural language processing (NLP) is still embedded in the subject of linguistics (Hirschberg & Manning, 2015). After the emergence of strong computational power, computational linguistics began to evolve and gradually branch out to various applications in NLP, such as text classification, speech recognition and question answering (Brownlee, 2017). Computational linguistics or natural language processing is usually defined as “subfield of computer science concerned with using computational techniques to learn, understand, and produce human language content” (Hirschberg & Manning, 2015, p. 261). <br />
<br />
With the development of deep neural networks, one type of neural network, namely recurrent neural networks (RNN) have performed significantly well in many natural language processing tasks. One of the first examples applies RNN to speech recognition tasks (Mikolov et. al. 2010). The reason is that nature of RNN takes into account the past inputs as well as the current input without resulting in vanishing or exploding gradient. More detail of how RNN works in the context of NLP will be discussed in the section of NLP using RNN. However, one limitation of RNN used in NLP is its enormous size of input vocabulary (i.e. the vocabulary is too large to compute). This will result in a very complex RNN model with too many parameters to train and makes the training process both time and memory-consuming. This serves as the major motivation for this paper’s authors to develop a new technique utilized in RNN, which is particularly efficient at processing a large vocabulary in many NLP tasks, namely LightRNN. In this work, the authors propose "LightRNN" which uses two-component (2C) shared word embedding for word representations.<br />
<br />
= Motivations =<br />
<br />
In language modelling, researchers used to represent words by arbitrary codes, such as “Id143” is the code for “dog” (“Vector Representations of Words,” 2017). Such coding of words is completely random, and it loses the meaning of the words and (more importantly) connection with other words. Nowadays, one-hot representation of words is commonly used, in which a word is represented by a vector of numbers and the dimension of the vector is related to the size of the vocabulary. In RNN, all words in the vocabulary are coded using one-hot representation and then mapped to an embedding vector (Li, Qin, Yang, Hu, & Liu, 2016). Such embedding vector is “a continuous vector space where semantically similar words are mapped to nearby points” (“Vector Representations of Words” 2017, para. 6). Popular RNN structure used in NLP task is long short-term memory (LSTM). In order to predict the probability of the next word, the last hidden layer of the network needs to calculate the probability distribution over all other words in the vocabulary. Note that the most time-consuming operation in RNNs is to calculate a probability distribution over all the words in the vocabulary, which requires the multiplication of the output-embedding matrix and the hidden state at each position of a sequence. Lastly, an activation function (commonly, softmax function) is used to select the next word with the highest probability. <br />
This method has 3 major limitations:<br />
<br />
* Memory Constraint <br />
:: When input vocabulary contains an enormous amount of unique words, which is very common in various NLP tasks, the size of model becomes very large. This means the number of trainable parameters is very big, which makes it difficult to fit such model on a regular GPU device.<br />
<br />
* Computationally Heavy to Train<br />
:: As previously mentioned, the probability distribution of all other words in the vocabulary needs to be computed to determine what predicted word it would be. When the size of the vocabulary is large, such calculation can be computationally heavy. <br />
<br />
* Low Compressibility<br />
:: Due to the memory and computation-consuming process of RNN applied in NLP tasks, mobile devices cannot usually handle such algorithm, which makes it undesirable and limits its usage.<br />
<br />
Previously, there were some works focusing on reducing the computing complexity in Softmax layer. By building a hierarchical binary tree where each node stands for a word, the time complexity is reduced to $\log(|V|)$. However, the space complexity remains same. In addition, some technicals, such as Character-level convolution filters, tried to reduce the model size by shrinking the input-embedding matrix, whereas brings no improvement in terms of speed.<br />
<br />
An alternative approach to handle the overhead is by leveraging weak lower-level learners by Boosting. But a drawback is that this technique has only been implemented for a few specific tasks in the past such as Time Series predictions [Boné et. al.].<br />
<br />
= LightRNN Structure =<br />
<br />
The authors of the paper proposed a new structure that effectively reduces the size of the model by arranging all words in the vocabulary into a word table, which is referred as “2-Component (2C) shared embedding for word representation”. This is done by factorizing a vocabulary's embedding into two shared components (row and column). Thus, a word is indexed by its location in such table, which in terms is characterized by the corresponding row and column components. Each row and column component are unique row vector and column vector respectively. By organizing each word in the vocabulary in this manner, multiple words can share the same row component or column component and it can reduce the number of trainable parameters significantly. <br />
The next question is how to construct such word table. More specifically, how to allocate each word in the vocabulary to different positions so that semantically similar words are in the same row or column. The authors proposed a bootstrap method to solve this problem. Essentially, we first randomly distribute words into the table. Then, we let the model “learn” better position of each word by minimizing training error. By repeating this process, each word can be allocated to a particular position within the table so that similar words share common row or column components. More details of those 2 parts of LightRNN structure will be discussed in the following sections.<br />
<br />
There are 2 major benefits of the proposed technique:<br />
<br />
* Computationally efficient<br />
<br />
:: The name “LightRNN” is to illustrate the small model size and fast training speed. Because of these features of the new RNN architecture, it’s possible to launch such model onto regular GPU and other mobile devices. <br />
<br />
* Higher scalability <br />
<br />
:: The authors briefly explained this algorithm is scalable because if parallel-computing is needed to train such model, the difficulty of combining smaller models is low. <br />
<br />
<br />
== Part I: 2-Component Shared Embedding ==<br />
<br />
The key aspect of LightRNN structure is its innovative method of word representation, namely 2-Component Shared Embedding. All words in the vocabulary are organized into a table with row components and column components. Each pair of the element in a row component and a column component is corresponding to a unique word in the vocabulary. For instance, the <math>i^{th}</math> row and <math>j^{th}</math> column are the row and column indexes for <math>X_{ij}</math>. As shown in the following graph, <math>x_{1}</math> is corresponding to the words “January”. In 2C shared embedding table, it’s indexed by 2 elements: <math>x^{r}_{1}</math> and <math>x^{c}_{1}</math> where the subscript indicates which row component and column component this word belongs to. Ideally, words that share similar semantic features should be assigned to the same row or column. The shared embedding word table in Figure 1 serves as a good example: the word “one” and “January” are assigned to the same column, while the word “one” and “two” are allocated to the same row. <br />
<br />
[[File:2C shared embedding.png|700px|thumb|centre|Fig 1. 2-Component Shared Embedding for Word Representation]]<br />
<br />
The main advantage of using such word representation is it reduces the number of vector/element needed for input word embedding. For instance, if there are 25 unique words in the vocabulary, the number of vectors to represent all the words is 10, namely 5 row vectors/elements and 5 column vectors/elements. Therefore, the shared embedding word table is a 5 by 5 matrix. In general, the formula for calculating number of vector/element needed to represent <math>|V|</math> words is <math>2\sqrt{|V|}</math>.<br />
<br />
== Part II: How 2C Shared Embedding is Used in LightRNN ==<br />
<br />
After constructing such word representation table, those 2-component shared embedding matrices are fed into the recurrent neural network. The following Figure 2 demonstrates a portion of LightRNN structure (left) with comparison with the regular RNN (right). Compared to regular RNN where a single input <math>x_{t-1}</math> is fed into the network each time, 2 elements of a single input <math>x_{t-1}</math>: <math>x^{r}_{t-1}</math> and <math>x^{c}_{t-1}</math> are fed into LightRNN. With the 2-Component shared embedding, we can construct the LightRNN model by doubling the basic units of a vanilla RNN model. If <math>n , m</math> denote the dimension of a row/column input vector and that of a hidden state vector respectively. To compute the probability distribution of $w_t$, we need to use the column vector <math> x_{t−1}^c ∈ R^n</math> the row vector <math>x_t^r ∈ R^n</math>, and the hidden<br />
state vector <math>h_{t−1}^r ∈ R^m</math>.<br />
<br />
[[File:LightRNN.PNG |700px|thumb|centre|Fig 2. LightRNN Structure & Regular RNN]]<br />
<br />
As mentioned before, the last hidden layer will produce the probabilities of <math>word_{t}</math>. Based on the diagram below, the following formulas are used:<br />
Let $n$ be the dimension/length of a row input vector/a column input vector, <math>X^{c}, X^{r} \in \mathbb{R}^{n \times \sqrt{|V|}}</math> denotes the input-embedding matrices: <br />
<center><br />
: row vector <math>x^{r}_{t-1} \in \mathbb{R}^n</math><br />
: column vector <math>x^{c}_{t-1} \in \mathbb{R}^n</math><br />
</center><br />
<br />
Let <math>h^{c}_{t-1}, h^{r}_{t-1} \in \mathbb{R}^m</math> denotes the two hidden layers where m = dimension of the hidden layer:<br />
<center><br />
: <math>h^{c}_{t-1} = f(W x_{t-1}^{c} + U h_{t-1}^{r} + b) </math><br />
: <math>h^{r}_{t} = f(W x_{t}^{r} + U h_{t-1}^{c} + b) </math><br />
</center><br />
where <math>W \in \mathbb{R}^{m \times n}</math>, <math>U \in \mathbb{R}^{m \times m}</math>, and <math>b \in \mathbb{R}^m</math> and <math>f</math> is a nonlinear activation function<br />
<br />
The final step in LightRNN is to calculate <math>P_{r}(w_{t})</math> and <math>P_{c}(w_{t})</math> , which means the probability of a word w at time t, using the following formulas:<br />
<center><br />
: <math>P_{r}(w_t) = \frac{exp(h_{t-1}^{c} y_{r(w)}^{r})}{\sum\nolimits_{i \in S_r} exp(h_{t-1}^{c} y_{i}^{r}) }</math><br />
: <math>P_{c}(w_t) = \frac{exp(h_{t}^{r} y_{c(w)}^{c})}{\sum\nolimits_{i \in S_c} exp(h_{t}^{r} y_{i}^{c}) }</math><br />
: <math> P(w_t) = P_{r}(w_t) P_{c}(w_t) </math> <br />
</center><br />
where <br />
<center><br />
:<math> r(w) </math> = row index of word w <br />
:<math> c(w) </math> = column index of word w<br />
:<math> y_{i}^{r} \in \mathbb{R}^m </math> = i-th vector of <math> Y^r \in \mathbb{R}^{m \times \sqrt{|V|}}</math> <br />
:<math> y_{i}^{c} \in \mathbb{R}^m </math> = i-th vector of <math> Y^c \in \mathbb{R}^{m \times \sqrt{|V|}}</math><br />
:<math> S_r </math> = the set of rows of the word table<br />
:<math> S_c </math> = the set of columns of the word table<br />
</center><br />
<br />
We can see that by using above equation, we effectively reduce the computation of the probability of the next word from a $|V|$-way normalization (in standard RNN models) to two $\sqrt {|V|}$-way normalizations. Note that we don't see the t-th word before predicting it. So in the above diagram, given the input column vector <math>x^c_{t-1} </math> of the (t-1)-th word, we first infer the row probability <math>P_r(w_t)</math> of the t-th word, and then choose the index of the row the largest probability in <math>P_r(w_t)</math> to look up the next input row vector <math>x^r_{t} </math>. Similarly, we can infer the column probability <math>P_c(w_t)</math> of the t-th word. <br />
<br />
Essentially, in LightRNN, the prediction of the word at time t (<math> w_t </math>) based on word at time t-1 (<math> w_{t-1} </math>) is achieved by selecting the index <math> r </math> and <math> c </math> with the highest probabilities <math> P_{r}(w_t) </math>, <math> P_{c}(w_t) </math>. Then, the probability of each word is computed based on the multiplication of <math> P_{r}(w_t) </math> and <math> P_{c}(w_t) </math>.<br />
<br />
== Part III: Bootstrap for Word Allocation ==<br />
<br />
As mentioned before, the major innovative aspect of LightRNN is the development of 2-component shared embedding. Such structure can be used in building a recurrent neural network called LightRNN. However, how such word table representation is constructed is the key part of building a successful LightRNN model. In this section, the procedures of constructing 2C shared embedding structure is explained. <br />
The fundamental idea is using bootstrap method by minimizing a loss function (namely, negative log-likelihood function). The detailed procedures are described as the following:<br />
<br />
Step 1: First, all words in a vocabulary are randomly assigned to individual position within the word table<br />
<br />
Step 2: Train LightRNN model based on word table produced in step 1 until certain criteria are met<br />
<br />
Step 3: By fixing the training results of input and output embedding matrices (W & U) from step 2, adjust the position of words by minimizing the loss function over all the words. Then, repeat from step 2<br />
<br />
Given a context with $T$ words, the authors presented the overall loss function for word w moving to position [i, j] using a negative log-likelihood function (NLL) as the following:<br />
<center><br />
<math> NLL = \sum\limits_{t=1}^T -logP(w_t) = \sum\limits_{t=1}^T -log[P_{r}(w_t) P_{c}(w_t)] = \sum\limits_{t=1}^T -log[P_{r}(w_t)] – log[P_{c}(w_t)] = \sum\limits_{w=1}^{|V|} NLL_w </math><br />
</center><br />
where <math> NLL_w </math> is the negative log-likelihood of a word w. <br />
<br />
Since in 2-component shared embedding structure, a word (w) is represented by one row vector and one column vector, <math> NLL_w </math> can be rewritten as <math> l(w, r(w), c(w)) </math> where <math> r(w) </math> and <math> c(w) </math> are the position index of word w in the word table. Next, the authors defined 2 more terms to explain the meaning of <math> NLL_w </math>: <math> l_r(w,r(w)) </math> and <math> l_c(w,c(w)) </math>, namely the row component and column component of <math> l(w, r(w), c(w)) </math>. The above can be summarised by the following formulas: <br />
<center><br />
<math> NLL_w = \sum\limits_{t \in S_w} -logP(w_t) = l(w, r(w), c(w)) </math> <br><br />
<math> = \sum\limits_{t \in S_w} -logP_r(w_t) + \sum\limits_{t \in S_w} -logP_c(w_t) = l_r(w,r(w)) + l_c(w,c(w))</math> <br><br />
<math> = \sum\limits_{t \in S_w} -log (\frac{exp(h_{t-1}^{c} y_{i}^{r})}{\sum\nolimits_{k} exp(h_{t-1}^{c} y_{i}^{k})}) + \sum\limits_{t \in S_w} -log (\frac{exp(h_{t}^{r} y_{j}^{c})}{\sum\nolimits_{k} exp(h_{t}^{r} y_{k}^{c}) }) </math> <br> <br />
where <math> S_w </math> is the set of all possible positions for the word w in the corpus </center><br />
In summary, the overall loss function for word w to move to position [i, j] is the sum of its row loss and column loss of moving to position [i, j]. Therefore, total loss of moving to position [i, j] <math> l(w, i, j) = l_r(w, i) + l_c(w, j)</math>. Thus, to update the table by reallocating each word, we are looking for position [i, j] for each word w that minimize the total loss function, mathematically written as for the following:<br />
<center><br />
<math> \min\limits_{a} \sum\limits_{w,i,j} l(w,i,j)a(w,i,j) </math> such that <br><br />
<math> \sum\limits_{(i,j)} a(w,i,j) = 1 \space \forall w \in V, \sum\limits_{(w)} a(w,i,j) = 1 \space \forall i \in S_r, j \in S_j</math> <br><br />
<math> a(w,i,j) \in {0,1}, \forall w \in V, i \in S_r, j \in S_j</math> <br><br />
where <math> a(w,i,j) =1 </math> indicates moving word w to position [i, j]<br />
</center><br />
<br />
After calculating $l(w, i, j)$ for all possible $w, i, j$, the above optimization leads forcing $a(w, i, j)$ to be equal to 1 for $i, j$ in which $l(w, i, j)$ is minimum and 0 elsewhere (i.e. finding the best place for the word $w$ in the table). This minimization problem is an classical assignment problem, which can be solved in polynomial time $O(|V|^3)$.<br />
<br />
= LightRNN Example =<br />
<br />
After describing the theoretical background of the LightRNN algorithm, the authors applied this method to 2 datasets (2013 ACL Workshop Morphological Language Dataset (ACLW) & One-Billion-Word Benchmark Dataset (BillonW)) and compared its performance with several other state-of-the-art RNN algorithms. The following table shows some summary statistics of those 2 datasets:<br />
<br />
[[File:Table1YH.PNG|700px|thumb|centre|Table 1. Summary Statistics of Datasets]]<br />
<br />
The goal of a probabilistic language model is either to compute the probability distribution of a sequence of given words (e.g. <math> P(W) = P(w_1, w_2, … , w_n)</math>) or to compute the probability of the next word given some previous words (e.g. <math> P(w_5 | w_1, w_2, w_3, w_4)</math>) (Jurafsky, 2017). In this paper, the evaluation matrix for the performance of LightRNN algorithm is perplexity <math> PPL </math> which is defined as the following: <br />
<center><br />
<math> PPL = exp(\frac{NLL}{T})</math> <br><br />
where T = number of tokens in the test set<br />
</center><br />
<br />
Based on the mathematical definition of PPL, a well-performed model will have a lower perplexity. <br />
The authors then trained “LSTM-based LightRNN using stochastic gradient descent with truncated backpropagation through time” (Li, Qin, Yang, Hu, & Liu, 2016). To begin with, the authors first used the ACLW French dataset to determine the size of embedding matrix. From the results shown in Table 2, larger embedding size corresponds to higher accuracy rate (expressed in terms of perplexity). Therefore, they adopted embedding size of 1000 to be used in LightRNN to analyze the ACLW datasets. <br />
<br />
[[File:Table2YH.PNG|700px|thumb|centre|Table 2. Testing PPL of LightRNN on ACLW-French dataset w.r.t. embedding size]]<br />
<br />
* In the official implement Github repo, Figure 3 shows the training process of LightRNN on ACLW-French dataset.<br />
[[File:ACLWFR.png|700px|thumb|centre|Figure 3.. Training process on ACLW-French]]<br />
<br />
'''Advantage 1: small model size'''<br />
<br />
One of the major advantages of using LightRNN on NLP tasks is significantly reduced model size, which means fewer number of parameters to estimate. By comparing LightRNN with two other RNN algorithms and the baseline language model with Kneser-Ney smoothing. Those two RNN algorithms are: HSM which uses LSTM RNN algorithm with hierarchical softmax for word prediction; C-HSM which uses both hierarchical softmax and character-level convolutional filters for input embedding. From the results table shown below, we can see that LightRNN has the lowest perplexity while keeping the model size significantly smaller compared to the other three algorithms. <br />
<br />
[[File:Table5YH.PNG|700px|thumb|centre|Table 3. PPL Results in test set on ACLW datasets]]<br />
Italic results are the previous state-of-the-art. #P denotes the number of parameters. <br />
<br />
'''Advantage 2: high training efficiency'''<br />
<br />
Another advantage of LightRNN model is its shorter training time while maintaining same level of perplexity compared to other RNN algorithms. When comparing to both C-HSM and HSM (shown below in Table 4), LightRNN only takes half the runtime but achieve same level of perplexity when applied to both ACLW and BillionW datasets. In the last column of Table 3, the amount of time used for word table reconstruction is presented as the percentage of the total runtime. As we can see, the training time for word reallocation takes up only a very small proportion of the total runtime. However, the resulting reconstructed word table can be used as a valuable output, which is further explained in the next section. <br />
<br />
[[File:Table3YH.PNG|700px|thumb|centre|Table 4. Runtime comparisons in order to achieve the HSMs’ baseline PPL]]<br />
<br />
<br />
'''Advantage 3: semantically valid word allocation table'''<br />
<br />
As explained in the previous section, LightRNN uses a word allocation table that gets updated in every iteration of the algorithm. The optimal structure of the table should assign semantically similar words onto the same row or column in order to reduce the number of parameters to estimate. Below is a snapshot of the reconstructed word table used in LightRNN algorithm. Evidently, we can see in row 887, all URL addresses are grouped together and in row 872 all verbs in past tense are grouped together. As the authors explained in the paper, LightRNN doesn’t assume independence of each word but instead using a shared embedding table. In this way, it reduces the model size by utilizing common embedding elements of the table/matrix, and also uses such preprocessed data to improve the efficiency of this algorithm.<br />
<br />
[[File:Table6YH.PNG|700px|thumb|centre|Table 6. Sample Word Allocation Table]]<br />
<br />
= Remarks =<br />
<br />
In summary, the proposed method in this paper is mainly on developing a new way of using word embedding which is natural extension of an 1-layer word embedding look-up table towards a 2-layer look-up table. Words with similar semantic meanings are embedded using similar vectors. Those vectors are then divided into row and column components where similar words are grouped together by having shared row and column components in the word representation table. The bootstrap step is promising since it learns a good word allocation (similar to word clustering). There could be a large impact on various natural language applications. From a computational and application perspective, there were two key contributions provided in this paper. <br />
<br />
# Reduction in size of word embedding matrix. <br />
# Reduction in computations of word probabilities. <br />
<br />
The proposed model makes no assumptions about the structure of the words, which makes it potentially useful outside of NLP. In contrast, character-based word embedding models also reduce the model size, but do need to access the internal structure of words (i.e. their characters). These two points ensures that one does not need hierarchical softmax or Monte carlo estimations of the model's training cost.<br />
This is indeed a dimensional reduction, i.e. use the row and column "semantic vectors" to approximate the coded word. Because of this structural change of input word embedding, RNN model needs to adapt by having both row and column components being fed into the network. However, the fundamental structure of RNN model does not change. Therefore, personally, I would say it’s a new word embedding technique rather than a new development in model construction. One major confusion I have when reading this paper is how those row and column components in the word allocation table are determined. From the paper itself, the authors didn’t explain how they are constructed. <br />
<br />
<br />
Such shared word embedding technique is prevalently used in NLP. For instance, in language translation, similar words from different languages are grouped together so that the machine can translate sentences from one language to another. In Socher et al. (2013a), English and Chinese words are embedded in the same space so that we can find similar English (Chinese) words for Chinese (English) words. (Zou, Socher, Cer, & Manning, 2013). Word2vec is also a commonly used technique for word embedding, which uses a two-layer neural network to transform text into numeric vectors where similar words will have similar numeric values. The key feature of word2vec is that semantically similar words (which is now represented by numeric vectors) can be grouped together (“Word2vec,” n.d.; Bengio, Ducharme, & Vincent, 2001; Bengio, Ducharme, Vincent, & Jauvin, 2003).<br />
<br />
An interesting area of further exploration proposed by the authors is an extension of this method to k-component shared embeddings where k>2. Words probably share similar semantic meanings in more than two dimensions, and this extension could reduce network size even further. However, it could also further complicate the bootstrapping phase of training.<br />
<br />
Since no assumptions were made about the structure of the words, one could seek uses of this algorithm outside the context of natural language processing. <br />
<br />
Overall, two-component embedding approach is interesting. However, the reported numbers on the one-billion word benchmark are worse than the best results reported in (Chelba et al 2013). In addition, the authors don't report run times so we can't figure out how much additional training time is added by the table allocation optimizer.<br />
<br />
Code for LightRNN can be found on Github : <br />
<br />
Official Implementation(CNTK): https://github.com/Microsoft/CNTK/tree/master/Examples/Text/LightRNN<br />
<br />
Tensorflow : https://github.com/YisenWang/LightRNN-NIPS2016-Tensorflow_code<br />
<br />
= Reference =<br />
Bengio, Y, Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. In Journal of Machine Learning Research (Vol. 3, pp. 932–938). https://doi.org/10.1162/153244303322533223<br />
<br />
Bengio, Yoshua, Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb), 1137–1155.<br />
<br />
Brownlee, J. (2017, September 20). 7 Applications of Deep Learning for Natural Language Processing. Retrieved October 27, 2017, from https://machinelearningmastery.com/applications-of-deep-learning-for-natural-language-processing/<br />
<br />
Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266. https://doi.org/10.1126/science.aaa8685<br />
<br />
Jurafsky, D. (2017, January). Language Modeling Introduction to N grams. Presented at the CS 124: From Languages to Information, Stanford University. Retrieved from https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf<br />
<br />
Li, X., Qin, T., Yang, J., Hu, X., & Liu, T. (2016). LightRNN: Memory and Computation-Efficient Recurrent Neural Networks. Advances in Neural Information Processing Systems 29, 4385–4393.<br />
<br />
Recurrent Neural Networks. (n.d.). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/recurrent<br />
<br />
Vector Representations of Words. (2017, August 17). Retrieved October 8, 2017, from https://www.tensorflow.org/tutorials/word2vec<br />
<br />
Word2vec. (n.d.). Retrieved October 26, 2017, from https://deeplearning4j.org/word2vec.html<br />
<br />
Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation, 1393–1398.<br />
<br />
Kneser Ney Smoothing - : https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing & http://www.foldl.me/2014/kneser-ney-smoothing/<br />
<br />
Boné R., Assaad M., Crucianu M. (2003) Boosting Recurrent Neural Networks for Time Series Prediction. In: Pearson D.W., Steele N.C., Albrecht R.F. (eds) Artificial Neural Nets and Genetic Algorithms. Springer, Vienna<br />
<br />
Mikolov T., Karafiat M., Burget L., Cernocky J. H., Khudanpur S. Recurrent neural network based language model. Interspeech 2010.<br />
<br />
Chelba et al 2013, One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dialog-based_Language_Learning&diff=30320Dialog-based Language Learning2017-11-14T19:02:41Z<p>Jdeng: /* Future work */</p>
<hr />
<div>This page is a summary for NIPS 2016 paper <i>Dialog-based Language Learning</i> [1].<br />
==Introduction==<br />
One of the ways humans learn language, especially second language or language learning by students, is by communication and getting its feedback. However, most existing research in Natural Language Understanding has focused on supervised learning from fixed training sets of labeled data. This kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. When humans act in dialogs (i.e., make speech utterances) the feedback from other human’s responses contain very rich information. This is perhaps most pronounced in a student/teacher scenario where the teacher provides positive feedback for successful communication and corrections for unsuccessful ones. <br />
<br />
This paper is about dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. This paper is a step towards the ultimate goal of being able to develop an intelligent dialog agent that can learn while conducting conversations. Specifically, this paper explores whether we can train machine learning models to learn from dialog.<br />
<br />
===Contributions of this paper===<br />
*Introduce a set of tasks that model natural feedback from a teacher and hence assess the feasibility of dialog-based language learning. <br />
*Evaluated some baseline models on this data and compared them to standard supervised learning. <br />
*Introduced a novel forward prediction model, whereby the learner tries to predict the teacher’s replies to its actions, which yields promising results, even with no reward signal at all<br />
<br />
Code for this paper can be found on Github:https://github.com/facebook/MemNN/tree/master/DBLL<br />
<br />
==Background on Memory Networks==<br />
<br/><br />
[[File:ershad_dialognetwork.png|center|700px]] <br />
<br/><br />
A memory network combines learning strategies from the machine learning literature with a memory component that can be read and written to.<br />
<br />
The high-level view of a memory network is as follows:<br />
*There is a memory, $m$, an indexed array of objects (e.g. vectors or arrays of strings).<br />
*An input feature map $I$, which converts the incoming input to the internal feature representation<br />
*A generalization component $G$ which updates old memories given the new input. <br />
*An output feature map $O$, which produces a new output in the feature representation space given the new input and the current memory state.<br />
*A response component $R$ which converts the output into the response format desired – for example, a textual response or an action.<br />
<br />
$I$, $G$, $O$ and $R$ can all potentially be learned components and make use of any ideas from the existing machine learning literature.<br />
<br />
In question answering systems, for example, the components may be instantiated as follows:<br />
*$I$ can make use of standard pre-processing such as parsing, coreference, and entity resolution. It could also encode the input into an internal feature representation by converting from text to a sparse or dense feature vector.<br />
*The simplest form of $G$ is to introduce a function $H$ which maps the internal feature representation produced by I to an individual memory slot, and just updates the memory at $H(I(x))$.<br />
*$O$ Reads from memory and performs inference to deduce the set of relevant memories needed to perform a good response.<br />
*$R$ would produce the actual wording of the question-answer based on the memories found by $O$. For example, $R$ could be an RNN conditioned on the output of $O$<br />
<br />
When the components $I$,$G$,$O$, & $R$ are neural networks, the authors describe the resulting system as a <b>Memory Neural Network (MemNN)</b>. They build a MemNN for QA (question answering) problems and compare it to RNNs (Recurrent Neural Network) and LSTMs (Long Short Term Memory RNNs) and find that it gives superior performance.<br />
<br />
[[File:DB_F2.png|center|800px]]<br />
<br />
==Related Work==<br />
<br />
'''Usefulness of feedback in language learning:''' Social interaction and natural infant directed conversations are shown to useful for language learning[2]. Several studies[3][4][5][6] have shown that feedback is especially useful in second language learning and learning by students.<br />
<br />
'''Supervised learning from dialogs using neural models:''' Neural networks have been used for response generation that can be trained end to end on large quantities of unstructured Twitter conversations[7]. However, this does not incorporate feedback from dialog partner during real-time conversation<br />
<br />
'''Reinforcement learning:''' Reinforcement learning works on dialogs[8][9], often consider reward as the feedback model rather than exploiting the dialog feedback per se. To be more specific, the reinforcement learning utilizes the system of rewards or what the authors of paper [8] called “trial-and-error”. The learning agent (in this case the language-learning agent) interacts with the dynamic environment (in this case through active dialog) and it receives feedback in the form of positive or negative rewards. By setting the objective function as maximizing the rewards, the model can be trained without explicit y responses. The reason why such algorithm is not particularly efficient in training a dialog-based language learning model is that there’s no explicit/fixed threshold of a positive or negative reward. One possible way to measure such action is to define what a successful completion of a dialog should be and use that as the objective function. <br />
<br />
'''Forward prediction models:''' Forward models describe the causal relationship between actions and their consequences, and the fundamental goal of an action is to predict the consequences of it. Although forward prediction models have been used in other applications like learning eye-tracking[10], controlling robot arms[11] and vehicles[12], it has not been used for dialog.<br />
<br />
==Dialog-based Supervision tasks==<br />
For testing their models, the authors chose two datasets (i) the single supporting fact problem from the [http://fb.ai/babi bAbI] datasets [13] which consists of short stories from a simulated world followed by questions; and (ii) the MovieQA dataset [14] which is a large-scale dataset (∼ 100k questions over ∼ 75k entities) based on questions with answers in the open movie database (OMDb)<br />
<br />
However, since these datasets were not designed to model the supervision from dialogs, the authors modified them to create 10 supervision task types on these datasets(Fig 3).<br />
<br />
[[File:DB F3.png|center|700px]]<br />
<br/><br />
<br />
*'''Task 1: Imitating an Expert Student''': The dialogs take place between a teacher and an expert student who gives semantically coherent answers. Hence, the task is for the learner to imitate that expert student, and become an expert themselves <br />
<br />
*'''Task 2: Positive and Negative Feedback:''' When the learner answers a question the teacher then replies with either positive or negative feedback. In the experiments, the subsequent responses are variants of “No, that’s incorrect” or “Yes, that’s right”. In the datasets, there are 6 templates for positive feedback and 6 templates for negative feedback, e.g. ”Sorry, that’s not it.”, ”Wrong”, etc. To distinguish the notion of positive from negative, an additional external reward signal that is not part of the text<br />
<br />
*'''Task 3: Answers Supplied by Teacher:''' The teacher gives positive and negative feedback as in Task 2, however when the learner’s answer is incorrect, the teacher also responds with the correction. For example if “where is Mary?” is answered with the incorrect answer “bedroom” the teacher responds “No, the answer is kitchen”’<br />
<br />
*'''Task 4: Hints Supplied by Teacher:''' The corrections provided by the teacher do not provide the exact answer as in Task 3, but only a useful hint. This setting is meant to mimic the real-life occurrence of being provided only partial information about what you did wrong.<br />
<br />
*'''Task 5: Supporting Facts Supplied by Teacher:''' Another way of providing partial supervision for an incorrect answer is explored. Here, the teacher gives a reason (explanation) why the answer is wrong by referring to a known fact that supports the true answer that the incorrect answer may contradict. <br />
<br />
*'''Task 6: Partial Feedback:''' External rewards are only given some of (50% of) the time for correct answers, the setting is otherwise identical to Task 3. This attempts to mimic the realistic situation of some learning being more closely supervised (a teacher rewarding you for getting some answers right) whereas other dialogs have less supervision (no external rewards). The task attempts to assess the impact of such partial supervision.<br />
<br />
*'''Task 7: No Feedback:''' External rewards are not given at all, only text, but is otherwise identical to Tasks 3 and 6. This task explores whether it is actually possible to learn how to answer at all in such a setting.<br />
<br />
*'''Task 8: Imitation and Feedback Mixture:''' Combines Tasks 1 and 2. The goal is to see if a learner can learn successfully from both forms of supervision at once. This mimics a child both observing pairs of experts talking (Task 1) while also trying to talk (Task 2).<br />
<br />
*'''Task 9: Asking For Corrections:''' The learner will ask questions to the teacher about what it has done wrong. Task 9 tests one of the most simple instances, where asking “Can you help me?” when wrong obtains from the teacher the correct answer.<br />
<br />
*'''Task 10: Asking for Supporting Facts:''' A less direct form of supervision for the learner after asking for help is to receive a hint rather than the correct answer, such as “A relevant fact is John moved to the bathroom” when asking “Can you help me?”. This is thus related to the supervision in Task 5 except the learner must request help<br />
<br />
[[File:F4.png|center|700px]]<br />
<br/><br />
<br />
The authors constructed the ten supervision tasks for both datasets. They were built in the following way: for each task, a fixed policy is considered for answering questions which gets questions correct with probability $π_{acc}$ (i.e. the chance of getting the red text correct in Figs. 3 and 4). We thus can compare different learning algorithms for each task over different values of $π_{acc}$ (0.5, 0.1 and 0.01). In all cases, a training, validation and test set is provided. Note that because the policies are fixed the experiments in this paper are not in a reinforcement learning setting.<br />
<br />
==Learning models==<br />
This work evaluates four possible learning strategies for each of the 10 tasks: imitation learning, reward-based imitation, forward prediction, and a combination of reward-based imitation and forward prediction<br />
<br />
All of these approaches are evaluated with the same model architecture: an end-to-end memory network (MemN2N) [15], which has been used as a baseline model for exploring different modes of learning.<br />
<br />
[[File:F5.png|center|700px]]<br />
<br/><br />
The input is the last utterance of the dialog, $x$, as well as a set of memories (context) (<math> c_1</math>, . . . , <math> c_n</math> ) which can encode both short-term memory, e.g. recent previous utterances and replies, and long-term memories, e.g. facts that could be useful for answering questions. The context inputs <math> c_i</math> are converted into vectors <math> m_i</math> via embeddings and are stored in the memory. The goal is to produce an output $\hat{a}$ by processing the input $x$ and using that to address and read from the memory, $m$, possibly multiple times, in order to form a coherent reply. In the figure, the memory is read twice, which is termed multiple “hops” of attention. <br />
<br />
In the first step, the input $x$ is embedded using a matrix $A$ of size $d$ × $V$ where $d$ is the embedding dimension and $V$ is the size of the vocabulary, giving $q$ = $A$$x$, where the input $x$ is as a bag-of words vector. Each memory <math> c_i</math> is embedded using the same matrix, giving $m_i$ = $A$$c_i$ . The output of addressing and then reading from memory in the first hop is: <br />
<br />
[[File:eq1.png|center|400px]]<br />
<br />
Here, $p^{1}$ is a probability vector over the memories, and is a measure of how much the input and the memories match. The goal is to select memories relevant to the last utterance $x$, i.e. the most relevant have large values of $p^{1}_i$ . The output memory representation $o_1$ is then constructed using the weighted sum of memories, i.e. weighted by $p^{1}$ . The memory output is then added to the original input, <math> u_1</math> = <math> R_1</math>(<math> o_1</math> + $q$), to form the new state of the controller, where <math> R_1</math> is a $d$ × $d$ rotation matrix . The attention over the memory can then be repeated using <math> u_1</math> instead of $q$ as the addressing vector, yielding: <br />
<br />
[[File:eq2.png|center|400px]]<br />
<br />
The controller state is updated again with <math> u_2</math> = <math> R_2</math>(<math> o_2</math> + <math> u_1</math>), where <math> R_2</math> is another $d$ × $d$ matrix to be learnt. In a two-hop model the final output is then defined as: <br />
<br />
[[File:eq3.png|center|400px]]<br />
<br />
where there are $C$ candidate answers in $y$. In our experiments, $C$ is the set of actions that occur in the training set for the bAbI tasks, and for MovieQA it is the set of words retrieved from the KB.<br />
<br />
==Training strategies==<br />
1. '''Imitation Learning'''<br />
This approach involves simply imitating one of the speakers in observed dialogs. Examples arrive as $(x, c, a)$ triples, where $a$ is (assumed to be) a good response to the last utterance $x$ given context $c$. Here, the whole memory network model defined above is trained using stochastic gradient descent by minimizing a standard cross-entropy loss between $\hat{a}$ and the label $a$<br />
<br />
2. '''Reward-based Imitation''' <br />
If some actions are poor choices, then one does not want to repeat them, that is we shouldn’t treat them as a supervised objective. Here, the positive reward is only obtained immediately after (some of) the correct actions, or else is zero. Only apply imitation learning on the rewarded actions. The rest of the actions are simply discarded from the training set. For more complex cases like actions leading to long-term changes and delayed rewards applying reinforcement learning algorithms would be necessary. e.g. one could still use policy gradient to train the MemN2N but applied to the model’s own policy.<br />
<br />
3. '''Forward Prediction''' <br />
The aim is, given an utterance $x$ from speaker 1 and an answer a by speaker 2 (i.e., the learner), to predict $x^{¯}$, the response to the answer from speaker 1. That is, in general, to predict the changed state of the world after action $a$, which in this case involves the new utterance $x^{¯}$.<br />
<br />
[[File:F6.png|center|700px]]<br />
<br/><br />
As shown in Figure (b), this is achieved by chopping off the final output from the original network of Fig (a) and replace it with some additional layers that compute the forward prediction. The first part of the network remains exactly the same and only has access to input x and context c, just as before. The computation up to $u_2$ = $R_2$($o_2$ + $u_1$) is thus exactly the same as before. <br />
<br />
Then perform another “hop” of attention but over the candidate answers rather than the memories. The information of which action (candidate) was actually selected in the dialog (i.e. which one is a) is also incorporated which is crucial. After this “hop”, the resulting state of the controller is then used to do the forward prediction.<br />
<br />
Concretely, we compute: <br />
<br />
[[File:eq4.png|center|550px]]<br />
<br />
where $β^{*}$ is a d-dimensional vector, that is also learned, that represents the output $o_3$ the action that was actually selected. The mechanism above gives the model a way to compare the most likely answers to $x$ with the given answer $a$. For example, if the given answer $a$ is incorrect and the model can assign high $p_i$ to the correct answer then the output $o_3$ will contain a small amount of $\beta^*$; conversely, $o_3$ has a large<br />
amount of $\beta^*$ if $a$ is correct. Thus, $o_3$ informs the model of the likely response $\bar{x}$ from the teacher. After obtaining $o_3$, the forward prediction is then computed as: <br />
<br />
[[File:eq5.png|center|500px]]<br />
<br />
where $u_3$ = $R_3$($o_3$ + $u_2$). That is, it computes the scores of the possible responses to the answer a over $\bar{C}$ possible candidates.<br />
<br />
Training can then be performed using the cross-entropy loss between $\hat{x}$ and the label $x ̄$, similar to before. In the event of a large number of candidates $\bar{C}$ we subsample the negatives, always keeping $x ̄$ in the set. The set of answers $y$ can also be similarly sampled, making the method highly scalable. Note that after training with the forward prediction criterion, at test time one can “chop off” the top again of the model to<br />
retrieve the original memory network model. One can thus use it to predict answers $\hat{a}$ given only $x$ and $c$.<br />
<br />
<br />
4. '''Reward-based Imitation + Forward Prediction'''<br />
As the reward-based imitation learning uses the architecture of Fig (a), and forward prediction uses the same architecture but with the additional layers of Fig (b), we can learn jointly with both strategies. This is a powerful combination as it makes use of reward signal when available and the dialog feedback when the reward signal is not available. In this approach, the authors share the weights across the two networks, and perform gradient steps for both criteria. Also, the compelling reason to consider this approach is that we should make use of the rewards when they are available. Hence they have taken the advantages of both the forward prediction and reward based approaches.<br />
<br />
==Experiments==<br />
<br />
Experiments were conducted on the two test datasets - bAbI and MovieQA. For each task, a fixed policy is considered for performing actions (answering questions) which gets questions correct with probability $π_{acc}$. This helps to compare the different training strategies described earlier over each task for different values of $π_{acc}$. Hyperparameters for all methods are optimized on the validation sets.<br />
<br/><br/><br />
[[File:DB F7.png|center|800px]]<br />
<br/><br />
The following results are observed by the authors:<br />
*Imitation learning, ignoring rewards, is a poor learning strategy when imitating inaccurate answers, e.g. for $π_{acc}$ < 0.5. For imitating an expert, however (Task 1) it is hard to beat. <br />
*Reward-based imitation (RBI) performs better when rewards are available, particularly in Table 1, but also degrades when they are too sparse e.g. for πacc = 0.01.<br />
*Forward prediction (FP) is more robust and has a stable performance at different levels of πacc. However as it only predicts answers implicitly and does not make use of rewards it is outperformed by RBI on several tasks, notably Tasks 1 and 8 (because it cannot do supervised learning) and Task 2 (because it does not take advantage of positive rewards).<br />
*FP makes use of dialog feedback in Tasks 3-5 whereas RBI does not. This explains why FP does better with useful feedback (Tasks 3-5) than without (Task 2), whereas RBI cannot.<br />
*Supplying full answers (Task 3) is more useful than hints (Task 4) but hints still help FP more than just yes/no answers without extra information (Task 2).<br />
*When positive feedback is sometimes missing (Task 6) RBI suffers especially in Table 1. FP does not as it does not use this feedback.<br />
*One of the most surprising results of our experiments is that FP performs well overall, given that it does not use feedback, which we will attempt to explain subsequently. This is particularly evident on Task 7 (no feedback) where RBI has no hope of succeeding as it has no positive examples. FP, on the other hand, learns adequately.<br />
*Tasks 9 and 10 are harder for FP as the question is not immediately before the feedback.<br />
*Combining RBI and FP ameliorates the failings of each, yielding the best overall results<br />
<br />
One of the most interesting aspects of the results in this paper is that FP works at all without any rewards.<br />
<br />
==Future work==<br />
* Any reply in a dialog can be seen as feedback and should be useful for learning. Evaluate if forward prediction, and the other approaches in this paper, work there too. <br />
* Develop further evaluation methodologies to test how the models presented here work in more complex settings where actions that are made lead to long-term changes in the environment and delayed rewards, i.e. extending to the reinforcement learning setting, and to full language generation. <br />
* How dialog-based feedback could also be used as a medium to learn non-dialog based skills, e.g. natural language dialog for completing visual or physical tasks. In the environment that actions can lead to long-term changes in the environment and delayed rewards, i.e. extending to the reinforcement learning setting.<br />
* A paper under review for ICLR 2017, also authored in-part by this paper's author, further extends the forward prediction method [17]. They assign a probability that the student will provide a random answer. The claim is that this allows the method to potentially discover correct answers. They also add data balancing, where they balance training across the all the teacher responses. This is supposed to ensure that one part of the distribution doesn't dominate during model learning.<br />
<br />
==Critique==<br />
<br />
The paper in its abstract says, there is no need for a reward, but a feedback by the partner saying "yes" or a "no" is a sort of reward. Yes, there is just a fixed policy used in learning, which makes this type of learning a subset of reinforcement learning which goes against the claim that this is not a reinforcement learning.<br />
<br />
Also, there are certain things that are not clearly explained, particularly the details of the forward-prediction model. It is not clear how the response to the answer is related to the learner's first input or to the answer. Hence it is not clear as to what the learner actually learns encodes into memory network. This also makes it impossible to say why this type of learning performs better than other approaches.<br />
<br />
It seems difficult to enhance this model to generate real complex queries to the user (not predefined ones), how would this method handle multiple dialogues turns with complex language? Also, the forward prediction architecture seems interesting but it can hardly be extended to multiple dialogue turns. On the other hand, reinforcement learning is interesting in dialogue: evaluating a value function implicitly learns a transition model and predicts future outcomes. Here, the authors say that this architecture allows avoiding the definition of a reward due to its ability to predict words such as "right" or "correct" but it is not always the case that the user gives feedback like (especially in real-world dialogues). It is worth comparing this method with reinforcement and imitation learning for dialogue management, which could eventually lead to more novel models.<br />
<br />
==References==<br />
# Jason Weston. Dialog-based Language Learning. NIPS, 2016.<br />
# P. K. Kuhl. Early language acquisition: cracking the speech code. Nature reviews neuroscience, 5(11): 831–843, 2004.<br />
# M. A. Bassiri. Interactional feedback and the impact of attitude and motivation on noticing l2 form. English Language and Literature Studies, 1(2):61, 2011.<br />
# R. Higgins, P. Hartley, and A. Skelton. The conscientious consumer: Reconsidering the role of assessment feedback in student learning. Studies in higher education, 27(1):53–64, 2002.<br />
# A. S. Latham. Learning through feedback. Educational Leadership, 54(8):86–87, 1997.<br />
# M. G. Werts, M. Wolery, A. Holcombe, and D. L. Gast. Instructive feedback: Review of parameters and effects. Journal of Behavioral Education, 5(1):55–75, 1995.<br />
# A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversational responses. Proceedings of NAACL, 2015.<br />
# V. Rieser and O. Lemon. Reinforcement learning for adaptive dialogue systems: a data-driven methodology for dialogue management and natural language generation. Springer Science & Business Media, 2011.<br />
# J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, 21(02):97–126, 2006.<br />
# J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(01n02):125–134, 1991.<br />
# I. Lenz, R. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics Science and Systems (RSS), 2015.<br />
# G. Wayne and L. Abbott. Hierarchical control using networks trained with higher-level forward models. Neural computation, 2014.<br />
# B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.<br />
# J. Clarke, D. Goldwasser, M.-W. Chang, and D. Roth. Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, pages 18–27. Association for Computational Linguistics, 2010.<br />
# J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, 21(02):97–126, 2006.<br />
# 9 memory networks for language understanding: https://www.youtube.com/watch?v=5ekMog_nhaQ<br />
# Li, Jiwei; Miller, Alexander H.; Chopra, Sumit; Ranzato, Marc'Aurelio; Weston, Jason. "Dialogue Learning With Human-In-The-Loop". Review for ICLR 2017.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dialog-based_Language_Learning&diff=30310Dialog-based Language Learning2017-11-14T18:48:34Z<p>Jdeng: /* Experiments */</p>
<hr />
<div>This page is a summary for NIPS 2016 paper <i>Dialog-based Language Learning</i> [1].<br />
==Introduction==<br />
One of the ways humans learn language, especially second language or language learning by students, is by communication and getting its feedback. However, most existing research in Natural Language Understanding has focused on supervised learning from fixed training sets of labeled data. This kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. When humans act in dialogs (i.e., make speech utterances) the feedback from other human’s responses contain very rich information. This is perhaps most pronounced in a student/teacher scenario where the teacher provides positive feedback for successful communication and corrections for unsuccessful ones. <br />
<br />
This paper is about dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. This paper is a step towards the ultimate goal of being able to develop an intelligent dialog agent that can learn while conducting conversations. Specifically, this paper explores whether we can train machine learning models to learn from dialog.<br />
<br />
===Contributions of this paper===<br />
*Introduce a set of tasks that model natural feedback from a teacher and hence assess the feasibility of dialog-based language learning. <br />
*Evaluated some baseline models on this data and compared them to standard supervised learning. <br />
*Introduced a novel forward prediction model, whereby the learner tries to predict the teacher’s replies to its actions, which yields promising results, even with no reward signal at all<br />
<br />
Code for this paper can be found on Github:https://github.com/facebook/MemNN/tree/master/DBLL<br />
<br />
==Background on Memory Networks==<br />
<br/><br />
[[File:ershad_dialognetwork.png|center|700px]] <br />
<br/><br />
A memory network combines learning strategies from the machine learning literature with a memory component that can be read and written to.<br />
<br />
The high-level view of a memory network is as follows:<br />
*There is a memory, $m$, an indexed array of objects (e.g. vectors or arrays of strings).<br />
*An input feature map $I$, which converts the incoming input to the internal feature representation<br />
*A generalization component $G$ which updates old memories given the new input. <br />
*An output feature map $O$, which produces a new output in the feature representation space given the new input and the current memory state.<br />
*A response component $R$ which converts the output into the response format desired – for example, a textual response or an action.<br />
<br />
$I$, $G$, $O$ and $R$ can all potentially be learned components and make use of any ideas from the existing machine learning literature.<br />
<br />
In question answering systems, for example, the components may be instantiated as follows:<br />
*$I$ can make use of standard pre-processing such as parsing, coreference, and entity resolution. It could also encode the input into an internal feature representation by converting from text to a sparse or dense feature vector.<br />
*The simplest form of $G$ is to introduce a function $H$ which maps the internal feature representation produced by I to an individual memory slot, and just updates the memory at $H(I(x))$.<br />
*$O$ Reads from memory and performs inference to deduce the set of relevant memories needed to perform a good response.<br />
*$R$ would produce the actual wording of the question-answer based on the memories found by $O$. For example, $R$ could be an RNN conditioned on the output of $O$<br />
<br />
When the components $I$,$G$,$O$, & $R$ are neural networks, the authors describe the resulting system as a <b>Memory Neural Network (MemNN)</b>. They build a MemNN for QA (question answering) problems and compare it to RNNs (Recurrent Neural Network) and LSTMs (Long Short Term Memory RNNs) and find that it gives superior performance.<br />
<br />
[[File:DB_F2.png|center|800px]]<br />
<br />
==Related Work==<br />
<br />
'''Usefulness of feedback in language learning:''' Social interaction and natural infant directed conversations are shown to useful for language learning[2]. Several studies[3][4][5][6] have shown that feedback is especially useful in second language learning and learning by students.<br />
<br />
'''Supervised learning from dialogs using neural models:''' Neural networks have been used for response generation that can be trained end to end on large quantities of unstructured Twitter conversations[7]. However, this does not incorporate feedback from dialog partner during real-time conversation<br />
<br />
'''Reinforcement learning:''' Reinforcement learning works on dialogs[8][9], often consider reward as the feedback model rather than exploiting the dialog feedback per se. To be more specific, the reinforcement learning utilizes the system of rewards or what the authors of paper [8] called “trial-and-error”. The learning agent (in this case the language-learning agent) interacts with the dynamic environment (in this case through active dialog) and it receives feedback in the form of positive or negative rewards. By setting the objective function as maximizing the rewards, the model can be trained without explicit y responses. The reason why such algorithm is not particularly efficient in training a dialog-based language learning model is that there’s no explicit/fixed threshold of a positive or negative reward. One possible way to measure such action is to define what a successful completion of a dialog should be and use that as the objective function. <br />
<br />
'''Forward prediction models:''' Forward models describe the causal relationship between actions and their consequences, and the fundamental goal of an action is to predict the consequences of it. Although forward prediction models have been used in other applications like learning eye-tracking[10], controlling robot arms[11] and vehicles[12], it has not been used for dialog.<br />
<br />
==Dialog-based Supervision tasks==<br />
For testing their models, the authors chose two datasets (i) the single supporting fact problem from the [http://fb.ai/babi bAbI] datasets [13] which consists of short stories from a simulated world followed by questions; and (ii) the MovieQA dataset [14] which is a large-scale dataset (∼ 100k questions over ∼ 75k entities) based on questions with answers in the open movie database (OMDb)<br />
<br />
However, since these datasets were not designed to model the supervision from dialogs, the authors modified them to create 10 supervision task types on these datasets(Fig 3).<br />
<br />
[[File:DB F3.png|center|700px]]<br />
<br/><br />
<br />
*'''Task 1: Imitating an Expert Student''': The dialogs take place between a teacher and an expert student who gives semantically coherent answers. Hence, the task is for the learner to imitate that expert student, and become an expert themselves <br />
<br />
*'''Task 2: Positive and Negative Feedback:''' When the learner answers a question the teacher then replies with either positive or negative feedback. In the experiments, the subsequent responses are variants of “No, that’s incorrect” or “Yes, that’s right”. In the datasets, there are 6 templates for positive feedback and 6 templates for negative feedback, e.g. ”Sorry, that’s not it.”, ”Wrong”, etc. To distinguish the notion of positive from negative, an additional external reward signal that is not part of the text<br />
<br />
*'''Task 3: Answers Supplied by Teacher:''' The teacher gives positive and negative feedback as in Task 2, however when the learner’s answer is incorrect, the teacher also responds with the correction. For example if “where is Mary?” is answered with the incorrect answer “bedroom” the teacher responds “No, the answer is kitchen”’<br />
<br />
*'''Task 4: Hints Supplied by Teacher:''' The corrections provided by the teacher do not provide the exact answer as in Task 3, but only a useful hint. This setting is meant to mimic the real-life occurrence of being provided only partial information about what you did wrong.<br />
<br />
*'''Task 5: Supporting Facts Supplied by Teacher:''' Another way of providing partial supervision for an incorrect answer is explored. Here, the teacher gives a reason (explanation) why the answer is wrong by referring to a known fact that supports the true answer that the incorrect answer may contradict. <br />
<br />
*'''Task 6: Partial Feedback:''' External rewards are only given some of (50% of) the time for correct answers, the setting is otherwise identical to Task 3. This attempts to mimic the realistic situation of some learning being more closely supervised (a teacher rewarding you for getting some answers right) whereas other dialogs have less supervision (no external rewards). The task attempts to assess the impact of such partial supervision.<br />
<br />
*'''Task 7: No Feedback:''' External rewards are not given at all, only text, but is otherwise identical to Tasks 3 and 6. This task explores whether it is actually possible to learn how to answer at all in such a setting.<br />
<br />
*'''Task 8: Imitation and Feedback Mixture:''' Combines Tasks 1 and 2. The goal is to see if a learner can learn successfully from both forms of supervision at once. This mimics a child both observing pairs of experts talking (Task 1) while also trying to talk (Task 2).<br />
<br />
*'''Task 9: Asking For Corrections:''' The learner will ask questions to the teacher about what it has done wrong. Task 9 tests one of the most simple instances, where asking “Can you help me?” when wrong obtains from the teacher the correct answer.<br />
<br />
*'''Task 10: Asking for Supporting Facts:''' A less direct form of supervision for the learner after asking for help is to receive a hint rather than the correct answer, such as “A relevant fact is John moved to the bathroom” when asking “Can you help me?”. This is thus related to the supervision in Task 5 except the learner must request help<br />
<br />
[[File:F4.png|center|700px]]<br />
<br/><br />
<br />
The authors constructed the ten supervision tasks for both datasets. They were built in the following way: for each task, a fixed policy is considered for answering questions which gets questions correct with probability $π_{acc}$ (i.e. the chance of getting the red text correct in Figs. 3 and 4). We thus can compare different learning algorithms for each task over different values of $π_{acc}$ (0.5, 0.1 and 0.01). In all cases, a training, validation and test set is provided. Note that because the policies are fixed the experiments in this paper are not in a reinforcement learning setting.<br />
<br />
==Learning models==<br />
This work evaluates four possible learning strategies for each of the 10 tasks: imitation learning, reward-based imitation, forward prediction, and a combination of reward-based imitation and forward prediction<br />
<br />
All of these approaches are evaluated with the same model architecture: an end-to-end memory network (MemN2N) [15], which has been used as a baseline model for exploring different modes of learning.<br />
<br />
[[File:F5.png|center|700px]]<br />
<br/><br />
The input is the last utterance of the dialog, $x$, as well as a set of memories (context) (<math> c_1</math>, . . . , <math> c_n</math> ) which can encode both short-term memory, e.g. recent previous utterances and replies, and long-term memories, e.g. facts that could be useful for answering questions. The context inputs <math> c_i</math> are converted into vectors <math> m_i</math> via embeddings and are stored in the memory. The goal is to produce an output $\hat{a}$ by processing the input $x$ and using that to address and read from the memory, $m$, possibly multiple times, in order to form a coherent reply. In the figure, the memory is read twice, which is termed multiple “hops” of attention. <br />
<br />
In the first step, the input $x$ is embedded using a matrix $A$ of size $d$ × $V$ where $d$ is the embedding dimension and $V$ is the size of the vocabulary, giving $q$ = $A$$x$, where the input $x$ is as a bag-of words vector. Each memory <math> c_i</math> is embedded using the same matrix, giving $m_i$ = $A$$c_i$ . The output of addressing and then reading from memory in the first hop is: <br />
<br />
[[File:eq1.png|center|400px]]<br />
<br />
Here, $p^{1}$ is a probability vector over the memories, and is a measure of how much the input and the memories match. The goal is to select memories relevant to the last utterance $x$, i.e. the most relevant have large values of $p^{1}_i$ . The output memory representation $o_1$ is then constructed using the weighted sum of memories, i.e. weighted by $p^{1}$ . The memory output is then added to the original input, <math> u_1</math> = <math> R_1</math>(<math> o_1</math> + $q$), to form the new state of the controller, where <math> R_1</math> is a $d$ × $d$ rotation matrix . The attention over the memory can then be repeated using <math> u_1</math> instead of $q$ as the addressing vector, yielding: <br />
<br />
[[File:eq2.png|center|400px]]<br />
<br />
The controller state is updated again with <math> u_2</math> = <math> R_2</math>(<math> o_2</math> + <math> u_1</math>), where <math> R_2</math> is another $d$ × $d$ matrix to be learnt. In a two-hop model the final output is then defined as: <br />
<br />
[[File:eq3.png|center|400px]]<br />
<br />
where there are $C$ candidate answers in $y$. In our experiments, $C$ is the set of actions that occur in the training set for the bAbI tasks, and for MovieQA it is the set of words retrieved from the KB.<br />
<br />
==Training strategies==<br />
1. '''Imitation Learning'''<br />
This approach involves simply imitating one of the speakers in observed dialogs. Examples arrive as $(x, c, a)$ triples, where $a$ is (assumed to be) a good response to the last utterance $x$ given context $c$. Here, the whole memory network model defined above is trained using stochastic gradient descent by minimizing a standard cross-entropy loss between $\hat{a}$ and the label $a$<br />
<br />
2. '''Reward-based Imitation''' <br />
If some actions are poor choices, then one does not want to repeat them, that is we shouldn’t treat them as a supervised objective. Here, the positive reward is only obtained immediately after (some of) the correct actions, or else is zero. Only apply imitation learning on the rewarded actions. The rest of the actions are simply discarded from the training set. For more complex cases like actions leading to long-term changes and delayed rewards applying reinforcement learning algorithms would be necessary. e.g. one could still use policy gradient to train the MemN2N but applied to the model’s own policy.<br />
<br />
3. '''Forward Prediction''' <br />
The aim is, given an utterance $x$ from speaker 1 and an answer a by speaker 2 (i.e., the learner), to predict $x^{¯}$, the response to the answer from speaker 1. That is, in general, to predict the changed state of the world after action $a$, which in this case involves the new utterance $x^{¯}$.<br />
<br />
[[File:F6.png|center|700px]]<br />
<br/><br />
As shown in Figure (b), this is achieved by chopping off the final output from the original network of Fig (a) and replace it with some additional layers that compute the forward prediction. The first part of the network remains exactly the same and only has access to input x and context c, just as before. The computation up to $u_2$ = $R_2$($o_2$ + $u_1$) is thus exactly the same as before. <br />
<br />
Then perform another “hop” of attention but over the candidate answers rather than the memories. The information of which action (candidate) was actually selected in the dialog (i.e. which one is a) is also incorporated which is crucial. After this “hop”, the resulting state of the controller is then used to do the forward prediction.<br />
<br />
Concretely, we compute: <br />
<br />
[[File:eq4.png|center|550px]]<br />
<br />
where $β^{*}$ is a d-dimensional vector, that is also learned, that represents the output $o_3$ the action that was actually selected. The mechanism above gives the model a way to compare the most likely answers to $x$ with the given answer $a$. For example, if the given answer $a$ is incorrect and the model can assign high $p_i$ to the correct answer then the output $o_3$ will contain a small amount of $\beta^*$; conversely, $o_3$ has a large<br />
amount of $\beta^*$ if $a$ is correct. Thus, $o_3$ informs the model of the likely response $\bar{x}$ from the teacher. After obtaining $o_3$, the forward prediction is then computed as: <br />
<br />
[[File:eq5.png|center|500px]]<br />
<br />
where $u_3$ = $R_3$($o_3$ + $u_2$). That is, it computes the scores of the possible responses to the answer a over $\bar{C}$ possible candidates.<br />
<br />
Training can then be performed using the cross-entropy loss between $\hat{x}$ and the label $x ̄$, similar to before. In the event of a large number of candidates $\bar{C}$ we subsample the negatives, always keeping $x ̄$ in the set. The set of answers $y$ can also be similarly sampled, making the method highly scalable. Note that after training with the forward prediction criterion, at test time one can “chop off” the top again of the model to<br />
retrieve the original memory network model. One can thus use it to predict answers $\hat{a}$ given only $x$ and $c$.<br />
<br />
<br />
4. '''Reward-based Imitation + Forward Prediction'''<br />
As the reward-based imitation learning uses the architecture of Fig (a), and forward prediction uses the same architecture but with the additional layers of Fig (b), we can learn jointly with both strategies. This is a powerful combination as it makes use of reward signal when available and the dialog feedback when the reward signal is not available. In this approach, the authors share the weights across the two networks, and perform gradient steps for both criteria. Also, the compelling reason to consider this approach is that we should make use of the rewards when they are available. Hence they have taken the advantages of both the forward prediction and reward based approaches.<br />
<br />
==Experiments==<br />
<br />
Experiments were conducted on the two test datasets - bAbI and MovieQA. For each task, a fixed policy is considered for performing actions (answering questions) which gets questions correct with probability $π_{acc}$. This helps to compare the different training strategies described earlier over each task for different values of $π_{acc}$. Hyperparameters for all methods are optimized on the validation sets.<br />
<br/><br/><br />
[[File:DB F7.png|center|800px]]<br />
<br/><br />
The following results are observed by the authors:<br />
*Imitation learning, ignoring rewards, is a poor learning strategy when imitating inaccurate answers, e.g. for $π_{acc}$ < 0.5. For imitating an expert, however (Task 1) it is hard to beat. <br />
*Reward-based imitation (RBI) performs better when rewards are available, particularly in Table 1, but also degrades when they are too sparse e.g. for πacc = 0.01.<br />
*Forward prediction (FP) is more robust and has a stable performance at different levels of πacc. However as it only predicts answers implicitly and does not make use of rewards it is outperformed by RBI on several tasks, notably Tasks 1 and 8 (because it cannot do supervised learning) and Task 2 (because it does not take advantage of positive rewards).<br />
*FP makes use of dialog feedback in Tasks 3-5 whereas RBI does not. This explains why FP does better with useful feedback (Tasks 3-5) than without (Task 2), whereas RBI cannot.<br />
*Supplying full answers (Task 3) is more useful than hints (Task 4) but hints still help FP more than just yes/no answers without extra information (Task 2).<br />
*When positive feedback is sometimes missing (Task 6) RBI suffers especially in Table 1. FP does not as it does not use this feedback.<br />
*One of the most surprising results of our experiments is that FP performs well overall, given that it does not use feedback, which we will attempt to explain subsequently. This is particularly evident on Task 7 (no feedback) where RBI has no hope of succeeding as it has no positive examples. FP, on the other hand, learns adequately.<br />
*Tasks 9 and 10 are harder for FP as the question is not immediately before the feedback.<br />
*Combining RBI and FP ameliorates the failings of each, yielding the best overall results<br />
<br />
One of the most interesting aspects of the results in this paper is that FP works at all without any rewards.<br />
<br />
==Future work==<br />
* Any reply in a dialog can be seen as feedback and should be useful for learning. Evaluate if forward prediction, and the other approaches in this paper, work there too. <br />
* Develop further evaluation methodologies to test how the models presented here work in more complex settings where actions that are made lead to long-term changes in the environment and delayed rewards, i.e. extending to the reinforcement learning setting, and to full language generation. <br />
* How dialog-based feedback could also be used as a medium to learn non-dialog based skills, e.g. natural language dialog for completing visual or physical tasks.<br />
* A paper under review for ICLR 2017, also authored in-part by this paper's author, further extends the forward prediction method [17]. They assign a probability that the student will provide a random answer. The claim is that this allows the method to potentially discover correct answers. They also add data balancing, where they balance training across the all the teacher responses. This is supposed to ensure that one part of the distribution doesn't dominate during model learning.<br />
<br />
==Critique==<br />
<br />
The paper in its abstract says, there is no need for a reward, but a feedback by the partner saying "yes" or a "no" is a sort of reward. Yes, there is just a fixed policy used in learning, which makes this type of learning a subset of reinforcement learning which goes against the claim that this is not a reinforcement learning.<br />
<br />
Also, there are certain things that are not clearly explained, particularly the details of the forward-prediction model. It is not clear how the response to the answer is related to the learner's first input or to the answer. Hence it is not clear as to what the learner actually learns encodes into memory network. This also makes it impossible to say why this type of learning performs better than other approaches.<br />
<br />
It seems difficult to enhance this model to generate real complex queries to the user (not predefined ones), how would this method handle multiple dialogues turns with complex language? Also, the forward prediction architecture seems interesting but it can hardly be extended to multiple dialogue turns. On the other hand, reinforcement learning is interesting in dialogue: evaluating a value function implicitly learns a transition model and predicts future outcomes. Here, the authors say that this architecture allows avoiding the definition of a reward due to its ability to predict words such as "right" or "correct" but it is not always the case that the user gives feedback like (especially in real-world dialogues). It is worth comparing this method with reinforcement and imitation learning for dialogue management, which could eventually lead to more novel models.<br />
<br />
==References==<br />
# Jason Weston. Dialog-based Language Learning. NIPS, 2016.<br />
# P. K. Kuhl. Early language acquisition: cracking the speech code. Nature reviews neuroscience, 5(11): 831–843, 2004.<br />
# M. A. Bassiri. Interactional feedback and the impact of attitude and motivation on noticing l2 form. English Language and Literature Studies, 1(2):61, 2011.<br />
# R. Higgins, P. Hartley, and A. Skelton. The conscientious consumer: Reconsidering the role of assessment feedback in student learning. Studies in higher education, 27(1):53–64, 2002.<br />
# A. S. Latham. Learning through feedback. Educational Leadership, 54(8):86–87, 1997.<br />
# M. G. Werts, M. Wolery, A. Holcombe, and D. L. Gast. Instructive feedback: Review of parameters and effects. Journal of Behavioral Education, 5(1):55–75, 1995.<br />
# A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversational responses. Proceedings of NAACL, 2015.<br />
# V. Rieser and O. Lemon. Reinforcement learning for adaptive dialogue systems: a data-driven methodology for dialogue management and natural language generation. Springer Science & Business Media, 2011.<br />
# J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, 21(02):97–126, 2006.<br />
# J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(01n02):125–134, 1991.<br />
# I. Lenz, R. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics Science and Systems (RSS), 2015.<br />
# G. Wayne and L. Abbott. Hierarchical control using networks trained with higher-level forward models. Neural computation, 2014.<br />
# B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.<br />
# J. Clarke, D. Goldwasser, M.-W. Chang, and D. Roth. Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, pages 18–27. Association for Computational Linguistics, 2010.<br />
# J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, 21(02):97–126, 2006.<br />
# 9 memory networks for language understanding: https://www.youtube.com/watch?v=5ekMog_nhaQ<br />
# Li, Jiwei; Miller, Alexander H.; Chopra, Sumit; Ranzato, Marc'Aurelio; Weston, Jason. "Dialogue Learning With Human-In-The-Loop". Review for ICLR 2017.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Coupled_GAN&diff=30305STAT946F17/ Coupled GAN2017-11-14T18:30:39Z<p>Jdeng: /* Experiments */</p>
<hr />
<div><br />
This is a summary of NIPS 2016 paper [1].<br />
== Introduction==<br />
<br />
Generative models attempt to characterize and estimate the underlying probability distribution of the data (typically images) and in doing so generate samples from the aforementioned learned distribution. Moment-matching generative networks, Variational auto-encoders, and Generative Adversarial Networks (GANs) are some of the most popular (and recent) class of techniques in this burgeoning literature on generative models. The authors of the paper we are reviewing focus on proposing an extension to the class of GANs.<br />
<br />
The novelty of the proposed Coupled GAN (CoGAN) method lies in extending the GAN procedure (described in the next section) to the multi-domain setting. That is, the CoGAN methodology attempts to learn the (underlying) joint probability distribution of multi-domain images as a natural extension from the marginal setting associated with the vanilla GAN framework. This is inspired by the idea that deep neural networks learn a hierarchical feature representation. Another GAN model that also tries to learn a joint distribution is triple-GAN [24], which is based on a designing a three-player game that helps to learn the joint distribution of observations and their corresponding labels. Given the dense and active literature on generative models, generating images in multiple domains is far from groundbreaking. Related works revolve around multi-modal deep learning ([2],[3]), semi-coupled dictionary learning ([4]), joint embedding space learning ([5]), cross-domain image generation ([6],[7]) to name a few. Thus, the novelty of the authors' contributions to this field comes from two key differentiating points. Firstly, this was (one of) the first papers to endeavor to generate multi-domain images with the GAN framework. Secondly, and perhaps more significantly, the authors proposed to learn the underlying joint distribution without requiring the presence of tuples of corresponding images in the training set. Only sets of images drawn from the (marginal) distributions of the separate domains is sufficient. As per the authors' claim, constructing tuples of corresponding images to train from is challenging and a potential bottleneck for multi-domain image generation. One way around this bottleneck is thus to use their proposed CoGAN methodology. More details of how the authors achieve joint-distribution learning will be provided in the Coupled GAN section below.<br />
<br />
== Generative Adversarial Networks==<br />
<br />
A typical GAN framework consists of a generative model and a discriminative model. The generative model, which often is a de-convolutional network, takes as input a random ''latent'' vector (typically uniform or Gaussian) and synthesizes novel images resembling the real images (training set). The discriminative model, often a convolutional network, on the other hand, tries to distinguish between the fake synthesized images and the real images. The idea then is to let the two component models of the GAN framework "compete" with each other in the form of a min-max two player game. <br />
<br />
To further clarify and fix this idea, we introduce the mathematical setup of GANs following the notation used by the authors of this paper for sake of consistency. Let us define the following in our setup:<br />
<br />
:<math> \mathbf{x}-</math> natural image drawn from underlying distribution <math> p_X</math>,<br />
:<math> \mathbf{z} \sim U[-1,1]^d-</math> a latent random vector,<br />
:$g-$ generative model, $f-$ discriminative model.<br />
<br />
Ideally we are aiming for the system of these two ''adversarial'' networks to behave as:<br />
:Generator: $g(\mathbf{z})$ outputs an image with same support as $\mathbf{x}$. The probability density of the images output by $g$ can be denoted by $p_G$,<br />
:Discriminator: $f(\mathbf{x})=1$ if $\mathbf{x} \sim p_X$ and $f(\mathbf{x})=0$ if $\mathbf{x} \sim p_G$.<br />
<br />
To train such a system of networks given our goal (i.e., $p_G \rightarrow p_X$) we must treat such a framework as the following minimax two player game:<br />
<br />
$\displaystyle \max_{g}$<br />
$\min\limits_{f} V(g,f) = \mathop{\mathbb{E}}_{\mathbf{x} \sim p_X}[-\log(f(\mathbf{x}))] + \mathop{\mathbb{E}}_{\mathbf{z} \sim p_{Z}(\mathbf{z})}[-\log(1-f(g(\mathbf{z})))] $.<br />
<br />
See [8], the seminal paper on this topic, for more information.<br />
<br />
Some of the crucial advantages of GANs are that Markov chains are never needed; only backprop is used to obtain gradients, no inference is needed during<br />
learning, and a wide variety of functions can be incorporated into the model [16].<br />
<br />
== Coupled Generative Adversarial Networks==<br />
<br />
The overarching goal of this framework is to learn a joint distribution of multi-domain images from data. That is, a density value is assigned to each joint occurrence of images in different domains. Examples of such pair of images in different domains include images of a particular scene with different modalities (color and depth) or images of the same face but with different facial attributes. <br />
<br />
To this end, the CoGAN setup consists of a pair of GANs, denoted as $GAN_1$ and $GAN_2$. Each GAN is tasked with synthesizing images in one domain. A naive training of such a system will result in learning the product of the two marginal distributions i.e., independence. However, by forcing the two GANs to share weights, the authors were able to demonstrate that they could in ''some sense'' learn the joint distribution of images. We will now describe the details of the generator and discriminator components of the setup and conclude this section with a summary of CoGAN learning algorithm.<br />
<br />
===Generator Models===<br />
<br />
Suppose $\mathbf{x_1} \sim p_{X_1}$ and $\mathbf{x_2} \sim p_{X_2}$ denote the natural images being drawn from the two marginal distributions of <br />
domain 1 and domain 2. Further, let $g_1$ be the generator of $GAN_1$ and $g_2$ be the generator of $GAN_2$. Both these generators take as input the latent vector $\mathbf{z}$ as defined in the previous section as input and out images in their specific domains. For completeness, denote the distributions of $g_1(\mathbf{z})$ and $g_2(\mathbf{z})$ as $p_{G_1}$ and $p_{G_2}$ respectively. We can characterize these two generator models as multi-layer perceptrons in the following way:<br />
<br />
\begin{align*}<br />
g_1(\mathbf{z})=g_1^{(m_1)}(g_1^{(m_1 -1)}(\dots g_1^{(2)}(g_1^{(1)}(\mathbf{z})))), \quad g_2(\mathbf{z})=g_2^{(m_2)}(g_2^{(m_2-1)}(\dots g_2^{(2)}(g_2^{(1)}(\mathbf{z})))),<br />
\end{align*}<br />
where $g_1^{(i)}$ and $g_2^{(i)}$ are the $i^{th}$ layers of $g_1$ and $g_2$ which respectively have a total of $m_1$ and $m_2$ layers each. Note $m_1$ need not be the same as $m_2$.<br />
<br />
As the generator networks can be thought of as an inverse of the prototypical convolutional networks (just as an example), the layers of these generator networks gradually decode information from high-level abstract concepts to low-level details(last few layers). Taking this idea as the blueprint for the inner-workings of generator networks, the author's hypothesize that corresponding images in two domains share the same high-level semantics but with differing lower-level details. To put this hypothesis to practice, they forced the first $k$ layers of $g_1$ and $g_2$ to have identical structures and share the same weights. That is, $\mathbf{\theta}_{g_1^{(i)}}=\mathbf{\theta}_{g_2^{(i)}}$ for $i=1,\dots,k$ where $\mathbf{\theta}_{g_1^{(i)}}$ and $\mathbf{\theta}_{g_1^{(i)}}$ represents the parameters of the layers $g_1^{(i)}$ and $g_2^{(i)}$ respectively. Hence the two generator networks share the starting $k$ layers of the deep network and have different last layers to decode the differing material details in each domain.<br />
<br />
===Discriminative Models===<br />
<br />
Suppose $f_1$ and $f_2$ are the respective discriminative models of the two GANs. These models can be characterized by <br />
\begin{align*}<br />
f_1(\mathbf{x}_1)=f_1^{(n_1)}(f_1^{(n_1 -1)}(\dots f_1^{(2)}(f_1^{(1)}(\mathbf{x}_1)))), \quad f_2(\mathbf{x}_2)=f_2^{(n_2)}(f_2^{(n_2-1)}(\dots f_2^{(2)}(f_2^{(1)}(\mathbf{x}_1)))),<br />
\end{align*}<br />
where $f_1^{(i)}$ and $f_2^{(i)}$ are the $i^{th}$ layers of $f_1$ and $f_2$ which respectively have a total of $n_1$ and $n_2$ layers each. Note $n_1$ need not be the same as $n_2$. In contrast to generator models, the first layers of $f_1$ and $f_2$ extract the lower level details where the last layers extract the abstract higher level details. To reflect the prior hypothesis of shared higher level semantics between corresponding images, we can force $f_1$ and $f_2$ to now share the weights for last $l$ layers. That is, $\mathbf{\theta}_{f_1^{(n_1-i)}}=\mathbf{\theta}_{f_2^{(n_2-i)}}$ for $i=0,\dots,l-1$ where $\mathbf{\theta}_{f_1^{(i)}}$ and $\mathbf{\theta}_{f_1^{(i)}}$ represents the parameters of the layers $f_1^{(i)}$ and $f_2^{(i)}$ respectively. Unlike in the generative models, weight sharing in the discriminative models is not essential to estimating the joint distribution of images, however it is beneficial by reducing the total number of parameters in the network.<br />
<br />
===Coupled GAN (CoGAN) Framework and Learning===<br />
The following figure taken from the paper summarizes the system of models described in the previous subsections. <br />
<center><br />
[[File:CoGAN-1.PNG]]<br />
</center><br />
The CoGAN framework can be expressed as the following constrained min-max game<br />
<br />
\begin{align*}<br />
\max\limits_{g_1,g_2} \min\limits_{f_1, f_2} V(f_1,f_2,g_1,g_2)\quad \text{subject to} \ \mathbf{\theta}_{g_1^{(i)}}=\mathbf{\theta}_{g_2^{(i)}}, i=1,\dots k, \quad \mathbf{\theta}_{f_1^{(n_1-j)}}=\mathbf{\theta}_{f_2^{(n_2-j)}}, j=1,\dots,l-1, <br />
\end{align*}<br />
where the value function V is characterized as <br />
\begin{align*}<br />
\mathop{\mathbb{E}}_{\mathbf{x}_1 \sim p_{X_1}}[-\log(f_1(\mathbf{x_1}))] + \mathop{\mathbb{E}}_{\mathbf{z} \sim p_{Z}(\mathbf{z})}[-\log(1-f_1(g_1(\mathbf{z})))]+\mathop{\mathbb{E}}_{\mathbf{x}_2 \sim p_{X_2}}[-\log(f_2(\mathbf{\mathbf{x}_2}))] + \mathop{\mathbb{E}}_{\mathbf{z} \sim p_{Z}(\mathbf{z})}[-\log(1-f_2(g_2(\mathbf{z})))]. <br />
\end{align*}<br />
<br />
For the purposes of storytelling, we can describe this game to have two teams with two players each. The generative models are on the same team and collaborate with each other to synthesize a pair of images in two different domains with the goal of fooling the discriminative models. Then, the discriminative models, with collaboration, try to differentiate between images drawn from the training data in their respective domains and the images generated by the respective generative models. The training algorithm for the CoGAN that was used is described in the following figure.<br />
<center><br />
[[File:CoGAN-2.PNG]]<br />
</center><br />
<br />
'''Important Remarks:''' <br />
<br />
CoGAN learning requires training samples drawn from the marginal distributions, $p_{X_1}$ and $p_{X_2}$ . It does not rely on samples drawn from the joint distribution, $p_{X_1,X_2}$ , where corresponding supervision would be available. Here, the main contribution is in showing that with just samples that ardrawn separately from the marginal distributions, CoGAN can learn a joint distribution of images in the two domains. Both weight-sharing constraint and adversarial training are essential for enabling this capability. <br />
<br />
Unlike autoencoder learning([3]), which encourages the generated image pair to be identical to the target pair, the adversarial training only encourages the generated pair of images to be individually resembling the images in the respective domains and ignores the correlation between them. Shared parameters, on the other hand, contribute to matching the correlation: the neurons responsible for decoding high-level semantics can be shared to produce highly correlated image pairs.<br />
<br />
==Experiments==<br />
<br />
To begin with, note that the authors do not use corresponding images in the training set in accordance with the goal of ''learning'' the joint distribution of multi-domain images without correspondence supervision. As at the time the paper was written, there were no existing approached with identical prerogatives (i.e., training with no correspondence supervision), they compared CoGAN with conditional GAN (see [10]) for more details on conditional GAN). A pair image generation performance metric was adopted for comparison.<br />
<br />
The authors varied the numbers of weight-sharing layers in the generative and discriminative models to create different CoGANs for analyzing the weight-sharing effect for both tasks. They observe that the performance was<br />
positively correlated with the number of weight-sharing layers in the generative models. With more sharing layers in the generative models, the rendered pairs of images resembled true pairs drawn from the joint distribution more. <br />
It is also noted that the performance was uncorrelated to the number of weight-sharing layers in the discriminative models. However, discriminator weight-sharing is still preferred because this reduces the total number of network parameters.<br />
<br />
===MNIST Dataset===<br />
<br />
The MNIST training set was experimented with two tasks:<br />
<br />
# Task A: Learning a joint distribution of a digit and its edge image. <br />
# Task B: Learning a joint distribution of a digit and its negative image. <br />
<br />
For the generative models, the authors used convolutional networks with 5 identical layers. They varied the number of shared layers as part of their experimental setup. The two discriminative models were a version of the LeNet ([9]). The results of the CoGAN generation scheme are displayed in the figure below.<br />
<br />
<center><br />
[[File:CoGAN-3.PNG]]<br />
</center><br />
As you can see from the figure above, the CoGAN system was able to generate pairs of corresponding images without explicitly training with correspondence supervision. This was naturally due to sharing weights in lower levels used for decoding high-level semantics. Without sharing these weights, the CoGAN would just output a pair of unrelated images in the two domains.<br />
<br />
To investigate the effects of weight sharing in the generator/discriminator models used for both tasks, the authors varied the number of shared levels. To quantify the performance of the generator, the image generated by $GAN_1$ (domain 1) was transformed to the 2nd domain using the same method used to generate training images in the 2nd domain. Then this transformed image was compared with the image generated by $GAN_2$. Naturally, if the joint distribution was learned completely, these two images would be identical. With that goal in mind, the authors used pixel agreement ratios for 10000 images as the evaluation metric. In particular, 5 trails with different weight initializations were used and an average pixel agreement ratio was taken. The results depicting the relationship between the average pixel agreement ratio and a number of shared layers are summarized in the figure below. <br />
<center><br />
[[File:CoGAN-4.PNG]]<br />
</center><br />
The results naturally offered some corroboration to our intuitions. The greater the number of shared layers in the generator models, the higher the pixel agreement ratios. Interestingly the number of shared layers in the discriminative model does not seem to affect the pixel agreement ratios. Note this is a pretty naive and toy example as we by nature of the evaluation criteria have a deterministic way of generating an image in the 2nd domain.<br />
<br />
Finally, for this example, the authors compared the CoGAN framework with the conditional GAN model. For the conditional GAN, the generative and discriminative models were identical to those used for the CoGAN results. The conditional GAN additionally took a binary variable (the conditioning variable) as input. When the binary variable was 0, the conditional GAN synthesized an image in domain 1 and when it was 1 an image in domain 2. Naturally, for a fair comparison, the training set did not already contain corresponding pairs of images. The experiments were conducted for the two tasks described above and the pixel agreement ratio (PAR) was the evaluation criteria. For Task A, the CoGAN resulted in a PAR of 0.952 in comparison with 0.909 for the conditional GAN. For Task B, the CoGAN resulted in a PAR of 0.967 compared with a PAR of 0.778 for the conditional GAN. The results are not particularly eye-opening as the CoGAN was more specifically designed for purpose of learning the joint distribution of multi-domain images whereas these tasks are just a very niche application for the conditional GAN. Nevertheless, for Task B, the results look promising.<br />
<br />
=== CelebFaces Attributes Dataset===<br />
<br />
For this experiment, the authors trained the CoGAN, using the CelebFaces Attributes Dataset, to generate pairs of faces with an attribute (domain 1) and without the attribute (domain 2). CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 number of identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image.<br />
Convolutional networks with 7 layers for both the generative and discriminative models were used. The dataset contains a large variety of poses and background clutter. Attributes can include blonde/non-blonde hair, smiling/ not smiling, or with/without sunglasses for example. The resulting synthesized pair of images are shown in the figure below realized as spectrum traveling from one point to another (resembles changing faces).<br />
<br />
<center><br />
[[File:CoGAN-5.PNG]]<br />
</center><br />
<br />
=== Color and Depth Images===<br />
For this experiment, the authors used two sources: the RGBD dataset and the NYU dataset. The RGBD dataset contains registered color and depth images of 300 objects. We partitioned the dataset into two equal-sized non-overlapping subsets. The color images in the 1st subset were used for training GAN1, while the depth images in the 2nd subset were used for training GAN2. The two image domains under consideration are the same for both the datasets. As usual, no corresponding images were fed into the training of the CoGAN framework. The resulting rendering of pairs of color and depth images for both the datasets are depicted in the figure below. <br />
<br />
<center><br />
[[File:CoGAN-6.PNG]]<br />
</center><br />
<br />
As is evident from the images, through the sharing of layers during training, the CoGAN was able to learn the appearance-depth correspondence.<br />
<br />
== Applications==<br />
<br />
===Unsupervised Domain Adaptation (UDA)===<br />
<br />
UDA involves adapting a classifier trained in one domain to conduct a classification task in a new domain which only contains ''unlabeled'' training data which disqualifies re-training of the classifier in this new domain. Some prior work in the field includes subspace learning ([11],[12]) and deep discriminative network learning ([13],[14]). The authors in the paper experimented with the MNIST and USPS datasets to showcase the applicability of the CoGAN framework for the UDA problem. A similar network architecture as was used for the MNIST experiment was employed for this application. The MNIST and USPS datasets has been denoted as $D_1$ and $D_2$ respectively in the paper. In accordance with the problem specification, no label information was used from $D_2$. <br />
<br />
The CoGAN is trained by jointly solving the classification problem in the MNIST domain using the labels provided in $D_1$ and the CoGAN learning problem which uses images for both $D_1$ and $D_2$. This training process produces two classifiers. That is, $c_1(x_1)≡c(f_1^{(3)}(f_1^{(2)}(f_1^{(1)}(x_1))))$ for MNIST and $c_2(x_2)≡c(f_2^{(3)}(f_2^{(2)}(f_2^{(1)}(x_2))))$ and USPS. Note $f_1^{(2)} ≡ f_2^{(2)}$ and $f_1^{(3)} ≡ f_2^{(3)}$ due to weight sharing and $c()$ here denotes the softmax layer which is added on top of the other layers in the respective discriminative networks of the GANs. Further due to weight sharing the last two layers of the discriminative model would be identical and have the same weights. The classifier $c_2$ is then used for digit classification in the USPS dataset. The author reported a 91.2% average accuracy when classifying the USPS dataset. The mirror problem of classifying the MNIST dataset without labels using the fully characterized USPS dataset achieved an average accuracy of 89.1%. These results appear to significantly outperform (prior top classification accuracy lies roughly around 60-65%) what the authors ''claim'' to be the state of the art methods in the UDA literature. In particular, the state of the art was noted to be described in [20].<br />
<br />
===Cross-Domain Image Transformation===<br />
<br />
Let $\mathbf{x}_1$ be an image in the 1st domain. The goal then is to find a corresponding image $\mathbf{x}_2$ in the 2nd domain such that the joint probability density $p(\mathbf{x}_1,\mathbf{x}_2)$ is maximized. Given the two generators $g_1$ and $g_2$, one can achieve the cross-domain transformation by first finding the latent random vector that generates the input image $\mathbf{x}_1$ in the 1st domain. This amounts to the optimization: $\mathbf{z}^{*}=argmin_{\mathbf{z}}L(g_1(\mathbf{z}),\mathbf{x}_1)$, where $L$ is the loss function measuring the difference/distance between the two images. After finding $z^*$, one can apply $g_2$ to obtain the transformed image, $x_2 = g_2(z^*)$. Some very preliminary results were are provided by the authors. In Figure 6, we show several CoGAN cross-domain transformation results, computed by using the Euclidean loss function and the L-BFGS optimization algorithm. Namely, the authors concluded that the transformation was successful when the input image was covered by $g_{1}$, but generated blurry images when this was not the case. Overall, there is nothing noteworthy to warrant discussion. It is hypothesized that more training images are required in addition to a better objective function. The figure depicting their results is provided below for sake of completeness.<br />
[[File:CoGAN-7.PNG]]<br />
<br />
===Eyewitness Facial Composite Generation===<br />
Since a Coupled GAN requires only a small set of images acquired separately from the marginal distributions of the individual domains, forensic criminal feature generation could find use for CoGAN. Due to the high noise and variation in facial composites derived from eyewitness statements (marginal), CoGAN could assist in narrowing down the search space.<br />
<br />
===Live Criminal Identification===<br />
Given a sufficiently large data set of images of a criminal (marginal) and sufficiently large data set of the images of ubiquitous places (e.g, local store, grocery stores, streets etc) (destination marginal), it would be possible to feed live footage to the coupled GAN discriminators for identifying & timestamping criminal visited locations<br />
<br />
== Discussion and Summary==<br />
In summary, this paper proposes a method for learning generative models using a pair of corresponding images which belongs to two different domains. For instance, a RGB image of a scene can be one image, and its corresponding image can consist of the image depth. For this approach, the authors use two adversarial networks with partially shared weights. In order for the models to generate pairs of corresponding images, both the generative models share weights which map the noise onto an intermediate code; however, the networks have independent weight that maps from the intermediate code to each image type. In order to validate the networks, the authors adopt several image data-sets. The main contributions of the paper can be summarized as:<br />
<br />
# A CoGAN framework for learning a joint distribution of multi-domain images was proposed. <br />
# The training is achieved by a simple weight sharing scheme for the generative and discriminative networks in the absence of any correspondence supervision in the training set. This can be construed as learning the joint distribution by using samples from the marginal distribution of images. <br />
# The experiments with digits, faces, and color/depth images provided some corroboration that the CoGAN system could synthesize corresponding pairs of images. <br />
# An application of the CoGAN framework for the problem of Unsupervised Domain Adaptation (UDA) was introduced. The preliminary results appear to be extremely promising for the task of adapting digit classifiers from MNIST to USPS data and vice-versa. <br />
#An application for the task of cross-domain image transformation was hypothesized with some very basic proof of concept results provided. <br />
# The setup naturally lends to more than the two domain setting focused on in the paper for experimental purposes. <br />
<br />
While the summary provided above adopted an objective filter, the following list enumerates the major ''subjective'' critical review points for this paper:<br />
# It appears the authors took components of various well-established techniques in the literature and produced the CoGAN framework. Weight-sharing is a well-documented idea as was correspondence/ multi-modal learning along with the GAN problem formation and training. However, when the components are put together in this way, they form a modest and timely novel contribution to the literature of generative networks.<br />
# With such prominent preliminary results for the problem of UDA, the authors could have provided some additional details of their training procedure (slightly unclear) and additional experiments under the UDA umbrella to fortify what appears to be a ''groundbreaking'' result when compared with state of the art methods. <br />
# The cross-domain image transformation application example was almost an afterthought. More details could have been provided in the supplementary file if pressed for space or perhaps just merely relegated to a follow-up paper/work.<br />
# The effectiveness of CoGAN to characterize joint distribution is exemplified by merely conducting experiments on MNIST and face generations. It seems that CoGAN ability to generate a pair of images from different distributions is limited to just modifying original images locally (for example, MNIST images show only very simple distribution difference such as edges, color). It would be interesting to run experiments to see where CoGAN starts to fail.<br />
<br />
Code for Co-GANs are available on Github :<br />
* Tensorflow & PyTorch : https://github.com/wiseodd/generative-models<br />
* Tensorflow : https://github.com/andrewliao11/CoGAN-tensorflow<br />
* Caffe : https://github.com/mingyuliutw/CoGAN<br />
* PyTorch : https://github.com/mingyuliutw/CoGAN_PyTorch<br />
<br />
== Critques ==<br />
The idea of CoGAN seems very interesting and powerful as it does not rely on pairs of corresponding training images for domain adaptation. The authors make a good effort of demonstrating and analyzing the capabilities of the approach in several ways. Visually the results look promising.But mostly from the qualitative evaluation, it is not clear to what extent the models are overfitting to the training data, and e.g. for the RGBD experiments, it is very hard to say anything more than that the generated pairs look superficially plausible. However, the model is never presented with corresponding image pairs (from joint distribution), thus there is actually nothing in the training data that establishes what “corresponding” (joint distribution) means. The only pressure for the network to establish a sensible correspondence between images in the two domains comes from the particular weight sharing constraint which allows each network only limited capacity to map from the shared intermediate layer to the two different types of images (the evaluation in the paper uses networks that use only one or two non-shared layers). This may be appropriate, and work well, for domain pairs that differ mostly in terms of low-level features (e.g. faces with blonde / non-blonde hair, or RGB and D images, as in the paper). But it makes me wonder how easy it would be to impose just the appropriate capacity constraint for domain pairs where the correspondence is at a more abstract level and/or more stochastic (e.g. images and text). Paper has not well established as to what level should weight sharing constraint must be applied for different types of domain pairs.<br />
<br />
==Related Works==<br />
Neural generative models have recently received an increasing amount of attention. Several approaches, including generative adversarial networks[8], variational autoencoders (VAE)[17], attention models[18], have shown that a deep network can learn an image distribution from samples. <br />
<br />
This paper focused on whether a joint distribution of images in different domains can be learned from samples drawn separately from its marginal distributions of the individual domains. <br />
<br />
Note that this work is different to the Attribute2Image work[19], which is based on a conditional VAE model [20]. The conditional model can be used to generate images of different styles, but they are unsuitable for generating images in two different domains such as color and depth image domains.<br />
<br />
This work is related to the prior works in multi-modal learning, including joint embedding space learning [5] and multi-modal Boltzmann machines [2]. These approaches can be used for generating corresponding samples in different domains only when correspondence annotations are given during training. This work is also related to the prior works in cross-domain image generation, which studied transforming an image in one style to the corresponding images in another style. However, the authors focus on learning the joint distribution in an unsupervised fashion. This paper now precedes a NIPS 2017 paper, with has one author in common. In this work, unsupervised image to image translation is further improved using Coupled GANs. They show results on street scene translation, animal image translation as well as the previously mentioned face image translation [23].<br />
<br />
== References and Supplementary Resources==<br />
:[1] Liu, Ming-Yu, and Oncel Tuzel. "Coupled generative adversarial networks." Advances in neural information processing systems. 2016.<br />
:[2] Srivastava, Nitish, and Ruslan R. Salakhutdinov. "Multimodal learning with deep boltzmann machines." Advances in neural information processing systems. 2012.<br />
:[3] Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.<br />
:[4] Wang, Shenlong, et al. "Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.<br />
:[5] Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural language models." arXiv preprint arXiv:1411.2539 (2014).<br />
:[6] Yim, Junho, et al. "Rotating your face using multi-task deep neural network." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.<br />
:[7] Reed, Scott E., et al. "Deep visual analogy-making." Advances in neural information processing systems. 2015. <br />
:[8] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.<br />
:[9] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.<br />
:[10] Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arXiv preprint arXiv:1411.1784 (2014).<br />
:[11] Long, Mingsheng, et al. "Transfer feature learning with joint distribution adaptation." Proceedings of the IEEE international conference on computer vision. 2013.<br />
:[12] Fernando, Basura, Tatiana Tommasi, and Tinne Tuytelaars. "Joint cross-domain classification and subspace learning for unsupervised adaptation." Pattern Recognition Letters 65 (2015): 60-66. <br />
:[13] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474 (2014).<br />
:[14] Rozantsev, Artem, Mathieu Salzmann, and Pascal Fua. "Beyond sharing weights for deep domain adaptation." arXiv preprint arXiv:1603.06432 (2016).<br />
:[15] http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html<br />
:[16] https://arxiv.org/pdf/1406.2661.pdf<br />
:[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.<br />
:[18] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015.<br />
:[19] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. arXiv:1512.00570, 2015.<br />
:[20] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In NIPS, 2014.<br />
:[21] A short summary of CoGAN with an example is given here: https://wiseodd.github.io/techblog/2017/02/18/coupled_gan/<br />
:[22] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In ICML, 2011.<br />
:[23] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. "Unsupervised Image-to-Image Translation Networks". In NIPS, 2017.<br />
:[24] Chongxuan Li, Kun Xu, Jun Zhu, Bo Zhang. "Triple Generative Adversarial Nets". In NIPS, 2017.<br />
<br />
Implementation Example on [https://github.com/mingyuliutw/cogan Github]</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_the_Number_of_Neurons_in_Deep_Networks&diff=29927Learning the Number of Neurons in Deep Networks2017-11-09T19:10:03Z<p>Jdeng: /* Critique */</p>
<hr />
<div>='''Introduction'''=<br />
<br />
Due to the availability of massive datasets and powerful computational infrastructure, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Modelling and Computer Vision. In essence, deep learning algorithms are a re-branding of neural networks from the 1950s, wherein we add multiple processing layers that we can now compute applications due to GPU power. It is important to note that the multiple processing layers (i.e. hidden layers) learn one-level of abstraction of data - this does not mean that we need to have numerous layers, the goal is to find the perfect number of layers such that the data that we are trying to generalize does not over-fit. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by trial and error manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.<br />
<br />
In this paper, the authors used an approach to automatically select the number of neurons in each layer when we learn the network. Their approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than training an initial network as a pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]) and reducing neurons later as a post-processing step. We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, the approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.<br />
<br />
In the experiments on several image recognition datasets, the authors showed the effectiveness of this approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.<br />
<br />
='''Related Work'''=<br />
<br />
Recent research tends to produce very deep networks. Building very deep networks means we need to learn more parameters, which leads to significant memory costs as well as a reduction in training speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it cannot handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts with a deep network and then reduces a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] that have little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network. Recent approaches for the destructive model focus on learning a shallower or thinner network that mimics the behavior of an initial deeper network.<br />
<br />
Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works have proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons in each layer is determined manually. A very similar work using the group lasso method for CNN was previously done in [Liu et al., 2015]. The big-picture idea appears to be very similar but they differ in details of methodology where [Liu et al.. 2015] involves computing the network Hessian and is repeated multiple times over the learning process. This is computationally expensive when dealing with large scale datasets and as a consequence, these techniques are no longer pursued in the current large-scale era.<br />
<br />
='''Model Training and Model Selection'''=<br />
<br />
In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is '''Rectified Linear Units(RELU) or sigmoids'''. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and where $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Here, $w_{l}^{n}$ is a linear operator acting on the layer’s input and $b_{l}^{n}$ is a bias. Given an input $x$, under the linear, non-linear and pooling operations, we obtain the output $\hat{y}=f(x,\Theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.<br />
<br />
At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:<br />
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Choices for such a regularizer include weight-decay, i.e., $\gamma(.)$ is the (squared) $\ell_{2}$-norm, of sparsity-inducing norms, e.g., the $\ell_{1}$-norm. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the '''group sparsity''' (GS) [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.<br />
<br />
The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''' (SGL), which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1})$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.<br />
<br />
This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression (explained in Hastie et al.,[https://web.stanford.edu/~hastie/Papers/ESLII.pdf The Elements of Statisticial Learning], section 3.4). In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus a L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus a L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms. <br />
<br />
To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it: <br />
<br />
We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$ <br />
<br />
<br />
Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$ <br />
<br />
Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$<br />
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem: <br />
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$<br />
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$<br />
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero. Additionally, when examining fully-connected layers, the neurons acting on output of zeroed-out neurons of the previous layer also become useless, and are removed accordingly.<br />
<br />
='''Experiment'''=<br />
<br />
==='''Set Up'''===<br />
<br />
They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a]. <br />
<br />
For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$. $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.<br />
<br />
For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.<br />
<br />
==='''Results'''===<br />
<br />
[[File:imageNet.png]]<br />
<br />
The above table show the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$. <br />
<br />
[[File:44.png]]<br />
<br />
[[File:2.png]]<br />
<br />
The above figures reports the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in number of neurons is spread all the layers. <br />
<br />
[[File:ICDA.png]]<br />
<br />
Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.<br />
<br />
All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.<br />
<br />
='''Analysis on Testing'''=<br />
<br />
The algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network, but also decreases the computational memory cost and increases the speed. <br />
<br />
[[File:table2.png]]<br />
<br />
The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time. The runtimes were obtained using a single Tesla K20m and memory estimations using RGB-images of size 224 × 224 for Ours-BNet, Ours-Dec8-640_GS and Ours-Dec8-768_GS, and gray level images of size 32 × 32 for Ours-Dec3-GS<br />
<br />
='''Conclusion'''=<br />
<br />
In this paper, the authors have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, they found the approach not only reduces the number of parameters in our model, but also saves the computation memory and increases the speed at test time. However, the limitation of the approach is that the number of layers in the network remains fixed.<br />
<br />
='''Critique'''=<br />
The authors of the paper state that "...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have<br />
two different weights: a relatively small one for the first few layers, and a larger weight for the<br />
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus<br />
retains enough information for the remaining ones." However, the authors fail to present any guidance as to what gets counted as "the first few layers" and what the relative sizes for the two weights should be even after we have chosen the "first few layers". Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper. Several numerical comparisons should be carried out to allow further discussion on this question.<br />
<br />
The parameters $\beta_l$ is important for the performance of the network. But the author does not provide enough details how to tune this parameters, using cross-validation or something else. And the performance of the model with various parameters setting $\beta_l$ is interesting and important to understand the robustness of this method. <br />
<br />
The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check. On top of that, authors could have compared to conventional L1 and L2 regularization which can reduce the number of non-zero parameters, as well as other techniques such as making setting small weight values to zero and performing fine tuning as done in https://www.microsoft.com/en-us/research/publication/exploiting-sparseness-in-deep-neural-networks-for-large-vocabulary-speech-recognition/. Also, the author could have applied the theory of ridge and Lasso regression to analyze the effect of the regularization mathematically.<br />
<br />
A rather reliable method of experimentation to compare the performance and accuracy has been left out. The authors have not stated any comparisons of this method with the Dropout method [Srivastava,2014], which is similar in terms of the physical effects on the network. The authors state that: "[...] Note that none of the standard regularizers mentioned above achieve this goal: The former favors small parameter values, and the latter tends to cancel out individual parameters, but not complete neurons." This draws a direct comparison to regularizers, ignoring that dropout methods exactly remove complete neurons.<br />
<br />
It would have been interesting to see the performance gain on real time applications such as YOLO or SSD object detectors that are being used in self-driving cars by incorporating the approach presented by the paper into its convolution neural nets. Meanwhile, as an interesting extension, it would be better if the authors could test this group sparse regularization in deep reinforcement learning, where a convolution neural network is used to predict the reward.<br />
<br />
As an important property of regularizer, the influence of the group sparse regularization on avoiding overfitting is yet unknown. The number of epochs increases or decreases after applying this regularization to achieve the same accuracy can be further studied.<br />
<br />
It seems as though the authors' claim that their approach "automatically determines the number of neurons" is overstated at best. In reality, this approach can find redundancy is an overspecified model, which provides the benefit of size reduction as outlined. This provides non-trivial benefits, but it has no way of addressing the (albeit less likely) issue of an underspecified model. In conjunction with the fact that the number of layers must remain fixed makes, this method has a feel of smart regularization, as opposed to size learning. Coupled together with the lack of dropout comparison leaves doubts regarding the efficacy of this technique for model specification. If a model must be intentionally over-specified to learn the parameters, then it is hard to claim memory reduction benefits vis-a-vis any technique stemming from an underspecified model. In any case, this may serve as an efficient technique for many of the networks used practically today which are designed to be extraordinarily massive, but labelling it a means of sizing a network is erroneous.<br />
<br />
='''References'''=<br />
<br />
P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.<br />
<br />
M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.<br />
<br />
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.<br />
<br />
I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.<br />
<br />
G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.<br />
<br />
N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.<br />
<br />
H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.<br />
<br />
Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf<br />
<br />
Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.<br />
<br />
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929-1958.<br />
<br />
<br />
Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):<br />
# http://www.onmyphd.com/?p=proximal.operator<br />
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_the_Number_of_Neurons_in_Deep_Networks&diff=29919Learning the Number of Neurons in Deep Networks2017-11-09T18:19:51Z<p>Jdeng: /* Model Training and Model Selection */</p>
<hr />
<div>='''Introduction'''=<br />
<br />
Due to the availability of massive datasets and powerful computational infrastructure, '''Deep Learning''' has made huge breakthroughs in many areas, like Language Modelling and Computer Vision. In essence, deep learning algorithms are a re-branding of neural networks from the 1950s, wherein we add multiple processing layers that we can now compute applications due to GPU power. It is important to note that the multiple processing layers (i.e. hidden layers) learn one-level of abstraction of data - this does not mean that we need to have numerous layers, the goal is to find the perfect number of layers such that the data that we are trying to generalize does not over-fit. In deep neural networks, we need to determine the number of layers and the number of neurons in each layer, i.e, we need to determine the number of parameters, or complexity of the model. Typically, this is determined by trial and error manually. Currently, this is mostly achieved by manually tuning these hyper-parameters using validation data or building very deep networks. However, building a very deep model is still challenging, especially for very large datasets, which leads to high cost on memory and reduction in speed.<br />
<br />
In this paper, the authors used an approach to automatically select the number of neurons in each layer when we learn the network. Their approach introduces a '''group sparsity regularizer''' on the parameters of the network, and each group acts on the parameters of one neuron, rather than training an initial network as a pre-processing step(training shallow or thin networks to mimic the behaviour of deep ones [Hinton et al., 2014, Romero et al., 2015]) and reducing neurons later as a post-processing step. We set those useless parameters to zero, which cancels out the effects of a particular neuron. Therefore, the approach does not need to learn a redundant network successfully and then reduce its parameters, instead, it learns the number of relevant neurons in each layer and the parameters of those neurons simultaneously.<br />
<br />
In the experiments on several image recognition datasets, the authors showed the effectiveness of this approach, which reduces the number of parameters by up to 80% compared to the complete model, and has no recognition accuracy loss at the same time. Actually, our approach even yields more effective and faster networks, and occupies less memory.<br />
<br />
='''Related Work'''=<br />
<br />
Recent research tends to produce very deep networks. Building very deep networks means we need to learn more parameters, which leads to significant memory costs as well as a reduction in training speed. Even though automatic model selection has developed in the past years by constructive and destructive approaches, there are some drawbacks. For '''constructive method''', it starts a super shallow architecture, and then adds additional parameters [Bello, 1992]. A similar work that adds new layers to the initial shallow networks was successfully employed [Simonyan and Zisserman, 2014] at the process of learning. However, we know shallow networks have fewer parameters, so that it cannot handle the non-linearities as effectively as the deep networks [Montufar et al., 2014], so shallow networks may easily get stuck by the bad optima. Therefore, the drawback of this method is that these networks may produce poor initializations for the later processes. The authors make this claim without ever providing any evidence for it. For '''destructive method''', it starts with a deep network and then reduces a significant number of redundant parameters [Denil et al., 2013, Cheng et al., 2015] while keeping its behaviour unchanged. Even though this technique has shown removing the redundant parameters [LeCun et al., 1990, Hassibi et al., 1993] or the neurons [Mozer and Smolensky, 1988, Ji et al., 1990, Reed, 1993] that have little influence on the output, it requires the analysis of each parameter and neuron by network Hessian, which is very computationally expensive for large architectures. The main motivation of these works was to build a more compact network. Recent approaches for the destructive model focus on learning a shallower or thinner network that mimics the behavior of an initial deeper network.<br />
<br />
Particularly, building a compact network is a research focus for '''Convolutional Neural Networks'''(CNNs). Some works have proposed to decompose the filters of a pre-trained network into low-rank filters, which reduces the number of parameters [Jaderberg et al., 2014b, Denton et al., 2014, Gong et al., 2014]. The issue of this proposal is that we need to successfully train an initial deep network since it acts as as post-processing step. [Weigend et al., 1991] and [Collins and Kohl, 2014] used direct training to develop regularizers that eliminate some of the parameters of the network. The problem is that the number of layers and neurons in each layer is determined manually. A very similar work using the group lasso method for CNN was previously done in [Liu et al., 2015]. The big-picture idea appears to be very similar but they differ in details of methodology where [Liu et al.. 2015] involves computing the network Hessian and is repeated multiple times over the learning process. This is computationally expensive when dealing with large scale datasets and as a consequence, these techniques are no longer pursued in the current large-scale era.<br />
<br />
='''Model Training and Model Selection'''=<br />
<br />
In general, a deep network has L layers containing linear operations on their inputs, intertwined with activation functions. The activation function we generally use is '''Rectified Linear Units(RELU) or sigmoids'''. Suppose each layer l has $N_{l}$ neurons, and each of them has parameters $\Theta=(\theta_{l})_{1\leqslant{l}\leqslant{L}}$, where $\theta_{l}=({\theta^n _{l}})_{1\leqslant{n}\leqslant{N_{l}}}$ and where $\theta^n _{l}=[w_{l}^{n},b_{l}^{n}]$. Here, $w_{l}^{n}$ is a linear operator acting on the layer’s input and $b_{l}^{n}$ is a bias. Given an input $x$, under the linear, non-linear and pooling operations, we obtain the output $\hat{y}=f(x,\Theta)$, where $f(*)$ encodes the succession of linear, non-linear and pooling operations.<br />
<br />
At the step of training, we have N input-output pairs ${(x_{i},y_{i})}_{1\leqslant{i}\leqslant{N}}$, and the loss function is given by $\ell(y_{i},f(x_{i},\Theta))$, which compares the predicted output with the ground-truth output. Generally, we choose logistic loss for classification and the square loss for regression. Therefore, learning the parameters of the network is equivalent to solving the optimization of the following:<br />
$$\displaystyle \min_{\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},f(x_{i},\Theta))+\gamma(\Theta),$$ where $\gamma(\Theta)$ represents a regularizer on the network parameters. Choices for such a regularizer include weight-decay, i.e., $\gamma(.)$ is the (squared) $\ell_{2}$-norm, of sparsity-inducing norms, e.g., the $\ell_{1}$-norm. The goal in this paper is to automatically determine the number of neurons of each layer, but neither of the above techniques achieve this goal. Here, we make use of the '''group sparsity''' (GS) [Yuan and Lin., 2007] (starting from an overcomplete network and canceling the influence of some neurons). The regularizer, therefore, can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2},$$ where $P_{l}$ means the size of the vector that includes the parameters of each neuron in layer $l$, and $\beta_{l}$ balances the influence of the penalty. In practice, we found the most effective way to select $\beta$ is a relatively small one for the first few layers, and a larger weight for the remaining layers. The reason we choose a small weight is that it can prevent deleting too much neurons in the first few layers, so that we have enough information for learning the remaining parameters. The original premise of this paper seemed to suggest a new method that was different from both the constructive and destructive methods described above. However, this approach of starting with an overcomplete network and training with group sparsity appears to be no different from destructive methods. The main contribution here is then the regularization function to act on entire neurons, which is in fairness an interesting approach.<br />
<br />
The group sparsity helps us effectively remove some of the neurons, and also standard regularizers on the individual parameters are effective for the generalization purpose [Bartlett, 19996, Krogh and Hertz, 1992, Theodoridis, 2015, Collins and Kohli, 2014]. By this idea, we introduce '''sparse group Lasso''' (SGL), which considers a more generalised penalty that merges L1 norm in Lasso with the group lasso (i.e. "two-norm"). This leads to the production of a penalty which specifies solutions that are sparse enough both at an individual and group feature levels [1]. It specifies that the regularizer can be written as $$\gamma(\Theta)=\sum_{l=1}^{L}((1-\alpha)\beta_{l}\sqrt{P_{l}}\sum_{n=1}^{N_{l}}||\theta_{l}^{n}||_{2}+\alpha\beta_{l}||\theta_{l}||_{1})$$ where $\alpha\in[0,1]$. We find that if $\alpha=0$, then we have the group sparsity regularizer. In practice, we use both $\alpha=0$ and $\alpha=0.5$ in the experiments.<br />
<br />
This reminds me of the relationships among Lasso regression, Ridge regression and Elastic Net regression (explained in Hastie et al.,[https://web.stanford.edu/~hastie/Papers/ESLII.pdf The Elements of Statisticial Learning], section 3.4). In lasso regression, the penalized residual sum of squares is composed of the regular residual sum of squared plus a L1 regularizer. In ridge regression, its penalized residual sum of squares is composed of the regular residual sum of squared plus a L2 regularizer. Finally, an elastic net regression is a combination of lasso regularizer and ridge regularizer, where its objective function is to optimize parameters by including both L1 and L2 norms. <br />
<br />
To find the optimization, in this paper we use proximal gradient descent [Parikh and Boyed, 2014]. This approach iteratively takes a gradient step of size t with respect to the loss. The following is the algorithm for it: <br />
<br />
We define proximal operator of f as $$prox_{f}(v)=\displaystyle \min_{x}(\frac{1}{2t}||x-v||_{2}^{2}+f(x))$$ <br />
<br />
<br />
Suppose we want to minimize $f(x)+g(x)$, and the proximal gradient method is given by $$x^{(k+1)}=prox_{t^{k}g}(x^{k}-t^{k}\nabla{f}(x^{k})), k=1,2,3...$$ <br />
<br />
Therefore, we can update our parameter by the above method as $$\tilde{\theta}_{l}^{n}=\displaystyle \min_{\theta_{l}^{n}}\frac{1}{2t}||\theta_{l}^{n}-\hat{\theta}_{l}^{n}||_{2}^{2}+\gamma(\Theta),$$<br />
where $\hat{\theta}_{l}^{n}$ is the solution obtained from the general loss gradient. By the derivative of [Simon et al., 2013], we have a closed-form solution for this problem: <br />
$$\tilde{\theta}_{l}^{n}=(1-\frac{t(1-\alpha)\beta_{l}\sqrt{P_{l}}}{||S(\hat{\theta}_{l}^{n},t\alpha\beta_{l})||_{2})})_{+}S(\hat{\theta}_{l}^{n},t\alpha\beta_{l}),$$<br />
where + refers to taking the maximum between the argument and 0, and $S(*)$ is $$S(a,b)=sign(a)(|a|-b)_{+}$$<br />
In practice, we use stochastic gradient descent and work with mini-batches, and then update the variables of all the groups according to the closed-form of $\tilde{\theta}_{l}^{n}$. When the learning steps terminate, we remove the neurons whose parameters have gone to zero. Additionally, when examining fully-connected layers, the neurons acting on output of zeroed-out neurons of the previous layer also become useless, and are removed accordingly.<br />
<br />
='''Experiment'''=<br />
<br />
==='''Set Up'''===<br />
<br />
They use two large-scale image classification datasets, '''ImageNet''' [Russakovsky et al., 2015] and '''Places2-401''' [Zhou et al., 2015]. They also conducted additional experiments on the '''ICDAR''' character recognition dataset of [Jaderberg et al., 2014a]. <br />
<br />
For ImageNet, they used the subset which contains 1000 categories, with 1.2 million training images and 50000 validation images. For Places2-401, it has more than 10 million images with 401 unique scene categories. 5000 to 30000 images are comprised into per category. Both architectures of these two datasets are based on the VGG-B network(BNet) [Simonyan and Zisserman, 2014] and on DecomposeMe8($Dec_{8}$) [ALvarez and Petersson, 2016]. BNet has 10 convolutional layers followed by 3 fully-connected layers. In the experiment, they remove the first 2 fully-connected layers, which we call $BNet^{C}$. $Dec_{8}$ contains 16 convolutional layers with 1D kernels, which can model 8 2D convolutional layers. Both models were trained for a total of 55 epochs with 12000 batches per epoch and a batch size of 48 and 180 for BNet and $Dec_{8}$, respectively. The learning rate was initialized by 0.01 and then multiplied by 0.1. They set $\beta_{l}$=0.102 for the first three layers and $\beta_{l}$=0.255 for the remaining ones.<br />
<br />
For ICDAR dataset, it consists of 185639 training and 5198 test data split into 36 categories. The architecture here starts 6 1D convolutional layers with max-pooling, rather than 3 convolutional layers with a maxout layer [Goodfellow et al., 2013] after each convolution, followed by one fully-connected layer. They call their architecture as Dec3. The model was trained for a total of 45 epochs with a batch size of 256 and 1000 iterations per epoch. The learning rate was initialized by 0.1 and multiplied by 0.1 in the second, seventh and fifteenth epochs. They set $\beta_{l}$=5.1 for the first layer and $\beta_{l}$=10.2 for the remaining ones.<br />
<br />
==='''Results'''===<br />
<br />
[[File:imageNet.png]]<br />
<br />
The above table show the accuracy comparisons between the original architectures and ours. For $Dec_{8}$ on the ImageNet dataset, we evaluated two additional models: $Dec_{8}-640$ with 640 neurons per layer and $Dec_{8}-768$ with 768 neurons per layer. $Dec_{8}-640_{SGL}$ means the sparse group Lasso regularizer with $\alpha=0.5$ and $Dec_{8}-640_{GS}$ represents the group sparsity regularizer. Note that all our architectures yield an improvement over the original network except $Dec_{8}-768$. For instance, Ours-$Bnet_{GS}^{C}$ increases the performance of 1.6% compared to $BNet^{C}$. <br />
<br />
[[File:44.png]]<br />
<br />
[[File:2.png]]<br />
<br />
The above figures reports the reduced percentage of neurons/parameters with our approach for $BNet^{C}$ and $Dec_{8}$. For example, in the first figure, our approach reduces the number of neurons by over 12% and the number of parameters by around 14%, while improving the generalization ability of 1.6%(as indicated by accuracy gap). The left image in the first figure also shows that reduction in number of neurons is spread all the layers with the largest difference in the L10. For $Dec_{8}$, in the second figure, we find when we increase the number of neurons in each layer, the benefits of our approach become more significant. For instance, $Dec_{8}-640$ with group sparsity regularizer reduces the number of neurons by 10%, and of parameters by 12.48%. The left image in the second figure also shows that reduction in number of neurons is spread all the layers. <br />
<br />
[[File:ICDA.png]]<br />
<br />
Finally, the above figure indicates the experiment results for ICDAR dataset. Here, we used the $Dec_{3}$ architecture, where the last two layers initially contain 512 neurons. The accuracy rate for $MaxPllo_{2Dneurons}$ is 83.8%, and accuracy rate for $Dec_{3}$ is 89.3%, which means 1D filters perform better than a network with 2D kernels. Our model on this dataset reduces 38.64% of neurons and totally up to 80% of the number of parameters with group sparsity regularizer.<br />
<br />
All the above results evidence that our algorithm effectively removes the number of parameters and increases the model accuracy. Our algorithm of automatic model selection effectively performs on the classification task.<br />
<br />
='''Analysis on Testing'''=<br />
<br />
The algorithm does not remove neurons during the training time, however, we remove those neurons after training, which yields a smaller network at test time. This improvement not only reduces the number of parameters of the network, but also decreases the computational memory cost and increases the speed. <br />
<br />
[[File:table2.png]]<br />
<br />
The above table reports the runtime, memory, as well as the percentage of reduced parameters after removing the zeroed-out neurons. The BNet and $Dec_{8}$ were tested on the dataset of ImageNet, while $Dec_{3-GS}$ was tested on the dataset of ICDAR. From the table, we find that all the models for the ImageNet and ICDAR have speeded up the runtime, for example, $Dec_{8}-768_{GS}$ on ImageNet data speeds up the runtime nearly 16% at the batch size of 8, and $Dec_{3}$ on ICDAR data speeds up nearly 50% at natch size of 16. For the percentage of parameters reduced, we find BNet, $Dec_{8}-640_{GS}$ and $Dec_{8}-768_{GS}$ have reduced 12.06%, 26.51%, and 46.73% respectively. More significantly, for $Dec_{3-GS}$, it reduces 82.35% of the parameters. All of these changes show the benefits at the testing time. The runtimes were obtained using a single Tesla K20m and memory estimations using RGB-images of size 224 × 224 for Ours-BNet, Ours-Dec8-640_GS and Ours-Dec8-768_GS, and gray level images of size 32 × 32 for Ours-Dec3-GS<br />
<br />
='''Conclusion'''=<br />
<br />
In this paper, the authors have introduced an approach that relies on group sparsity regularizer. This approach automatically determines the number of neurons in each layer of a deep network. From the experiments, they found the approach not only reduces the number of parameters in our model, but also saves the computation memory and increases the speed at test time. However, the limitation of the approach is that the number of layers in the network remains fixed.<br />
<br />
='''Critique'''=<br />
The authors of the paper state that "...we assume that the parameters of each neuron in layer $l$ are grouped in a vector of size $P_{l}$ and where $\lambda_{l}$ sets the influence of the penalty. Note that, in the general case, this weight can be different for each layer $l$. In practice, however, we found most effective to have<br />
two different weights: a relatively small one for the first few layers, and a larger weight for the<br />
remaining ones. This effectively prevents killing too many neurons in the first few layers, and thus<br />
retains enough information for the remaining ones." However, the authors fail to present any guidance as to what gets counted as "the first few layers" and what the relative sizes for the two weights should be even after we have chosen the "first few layers". Indeed, such choice seems to be an unaccounted component of tuning the model but this receives scant attention in the current paper. Several numerical comparisons should be carried out to allow further discussion on this question.<br />
<br />
The experiments could have included better baseline models to compare against. For example, how do we know the original model was not overly complex to begin with? It might have been a good idea for the authors to compare their group sparse lasso method against the naive method of (blindly) reducing the number of neurons in each layer by 10-20% just for a very preliminary check. On top of that, authors could have compared to conventional L1 and L2 regularization which can reduce the number of non-zero parameters, as well as other techniques such as making setting small weight values to zero and performing fine tuning as done in https://www.microsoft.com/en-us/research/publication/exploiting-sparseness-in-deep-neural-networks-for-large-vocabulary-speech-recognition/. Also, the author could have applied the theory of ridge and Lasso regression to analyze the effect of the regularization mathematically.<br />
<br />
A rather reliable method of experimentation to compare the performance and accuracy has been left out. The authors have not stated any comparisons of this method with the Dropout method [Srivastava,2014], which is similar in terms of the physical effects on the network. The authors state that: "[...] Note that none of the standard regularizers mentioned above achieve this goal: The former favors small parameter values, and the latter tends to cancel out individual parameters, but not complete neurons." This draws a direct comparison to regularizers, ignoring that dropout methods exactly remove complete neurons.<br />
<br />
It would have been interesting to see the performance gain on real time applications such as YOLO or SSD object detectors that are being used in self-driving cars by incorporating the approach presented by the paper into its convolution neural nets. Meanwhile, as an interesting extension, it would be better if the authors could test this group sparse regularization in deep reinforcement learning, where a convolution neural network is used to predict the reward.<br />
<br />
As an important property of regularizer, the influence of the group sparse regularization on avoiding overfitting is yet unknown. The number of epochs increases or decreases after applying this regularization to achieve the same accuracy can be further studied.<br />
<br />
It seems as though the authors' claim that their approach "automatically determines the number of neurons" is overstated at best. In reality, this approach can find redundancy is an overspecified model, which provides the benefit of size reduction as outlined. This provides non-trivial benefits, but it has no way of addressing the (albeit less likely) issue of an underspecified model. In conjunction with the fact that the number of layers must remain fixed makes, this method has a feel of smart regularization, as opposed to size learning. Coupled together with the lack of dropout comparison leaves doubts regarding the efficacy of this technique for model specification. If a model must be intentionally over-specified to learn the parameters, then it is hard to claim memory reduction benefits vis-a-vis any technique stemming from an underspecified model. In any case, this may serve as an efficient technique for many of the networks used practically today which are designed to be extraordinarily massive, but labelling it a means of sizing a network is erroneous.<br />
<br />
='''References'''=<br />
<br />
P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.<br />
<br />
M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, Nov 1992.<br />
<br />
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, 2015.<br />
<br />
I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.<br />
<br />
G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In arXiv, 2014.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014a.<br />
<br />
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014b.<br />
<br />
N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.<br />
<br />
H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact CNNs. In ECCV, 2016.<br />
<br />
Group LASSO - https://pdfs.semanticscholar.org/f677/a011b2a912e3c5c604f6872b9716cc0b8aa0.pdf<br />
<br />
Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.<br />
<br />
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929-1958.<br />
<br />
<br />
Derivation & Motivation of the Soft Thresholding Operator (Proximal Operator):<br />
# http://www.onmyphd.com/?p=proximal.operator<br />
# https://math.stackexchange.com/questions/471339/derivation-of-soft-thresholding-operator</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_the_Effective_Receptive_Field_in_Deep_Convolutional_Neural_Networks&diff=29671Understanding the Effective Receptive Field in Deep Convolutional Neural Networks2017-11-07T19:12:47Z<p>Jdeng: /* Reduce Gaussian Damage */</p>
<hr />
<div>= Introduction =<br />
== What is the Receptive Field (RF) of a unit? ==<br />
[[File:understanding_ERF_fig0.png|thumbnail|450px]]<br />
The receptive field of a unit is the region of input that is seen and responded to by the unit. When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume. Instead, we connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron (equivalently this is the filter size). Hence, the dept of the input is the extent of the connectivity along the depth axis [4]. For instance, if we take an RGB CIFAR-10 image which has the input size of 32x32x3 (height, width, channels), wherein we have a receptive field (or the filter size) of 5x5 - then each neuron in the convolutional layer will have weights to a 5x5x3 region for every input image giving a total of 5*5*3 = 75 weights. It is important to note that the extent of the connectivity along the depth axis is 3 as the depth of the input is 3 (i.e. the channels).<br />
<br />
An effective introduction to Receptive field arithmetic, including ways to calculate the receptive field of CNNs can be found [https://syncedreview.com/2017/05/11/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks/ here]<br />
<br />
== Why is RF important? ==<br />
The concept of receptive field is important for understanding and diagnosing how deep Convolutional neural networks (CNNs) work. Unlike in fully connected networks, where the value of each unit depends on the<br />
entire input to the network, in CNNs, anywhere in an input image outside the receptive field of a unit does not affect the value of that unit. Hence, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. The property of receptive field allows for the response to be most sensitive to a local region in the image and to specific stimuli; similar stimuli trigger activations of similar magnitudes [2]. The initialization of each receptive field depends on the neuron's degrees of freedom [2]. One example outlined in this paper is that "the weights can be either of the same sign or centered with zero mean. This latter case favors a response to the contrast between the central and peripheral region of the receptive field." [2]. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for every single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction.<br />
<br />
== How to increase RF size? ==<br />
''' Make the network deeper''' by stacking more layers, which increases the receptive field size linearly by theory, as<br />
each extra layer increases the receptive field size by the kernel size (more accurate to say kernel size-1).<br />
<br />
'''Add sub-sampling layers''' to increase the receptive field size multiplicatively. Actually, sub-sampling is simply AveragePooling with learnable weights per feature map. It acts like low pass filtering and then downsampling。<br />
<br />
Modern deep CNN architectures like the VGG networks and Residual Networks use a combination of these techniques.<br />
<br />
== Intuition behind Effective Receptive Fields ==<br />
The pixels at the center of an RF have a much larger impact on an output:<br />
* In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptive field have very few paths to propagate its impact. <br />
* In the backward pass, gradients from an output unit are propagated across all the paths, and therefore the central pixels have a much larger magnitude for the gradient from that output [More paths always mean larger gradient?].<br />
* Not all pixels in a receptive field contribute equally to an output unit's response.<br />
<br />
The authors prove that in many cases the distribution of impact in a receptive field distributes as a Gaussian. Since Gaussian distributions generally decay quickly from the center, the effective receptive field, only occupies a fraction of the theoretical receptive field.<br />
<br />
The authors have correlated the theory of effective receptive field with some empirical observations. One such observation is that the random initializations lead some deep CNNs to start with a small effective receptive field, which then grows on training, which indicates a bad initialization bias.<br />
<br />
= Theoretical Results =<br />
<br />
The authors wanted to mathematically characterize how much each input pixel in a receptive field can impact<br />
the output of a unit $n$ layers up the network, i.e. when $n \rightarrow \infty$. More specifically, assume that pixels on each layer are indexed by $(i,j)$ with their centre at $(0,0)$. If we denote the pixel on the $p$th layer as $x_{i,j}^p$ , with $x_{i,j}^0$ as the input to the network, and $y_{i,j}=x_{i,j}^n$ as the output on the $n$th layer, we want to know how much each $x_{i,j}^0$ contributes to $y_{0,0}$. The effective receptive field (ERF) of this central output unit can be defined as the region containing input pixels with a non-negligible impact on it. <br />
<br />
They used the partial derivative $\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$ as the measure of such impact, which can be computed using backpropagation. Assuming $l$ as an arbitrary loss by the chain rule we can write $\frac{\partial l}{\partial x_{i,j}^0} = \sum_{i',j'}\frac{\partial l}{\partial y_{i',j'}}\frac{\partial y_{i',j'}}{\partial x_{i,j}^0}$. Now if $\frac{\partial l}{\partial y_{0,0}} =1$ and $\frac{\partial l}{\partial y_{i,j}}=0$ for all $i \neq 0$ and $j \neq 0$, then $\frac{\partial l}{\partial x_{i,j}^0} =\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$.<br />
<br />
For networks without nonlinearity (i.e., linear networks), this measure is independent of the input and depends only on the weights of the network and (i, j), which clearly shows how the impact of the pixels in the receptive field distributes.<br />
<br />
===Simplest case: Stack of convolutional layers of weights equal to 1===<br />
<br />
The authors first considered the case of $n$ convolutional layers using $k \times k$ kernels of stride 1 and a single channel on each layer and no nonlinearity, and bias. <br />
<br />
<br />
For this special sub-case, the kernel was a $k \times k$ matrix of 1's. Since this kernel is separable to $k \times 1$ and $1 \times k$ matrices, the $2D$ convolution could be replaced by two $1D$ convolutions. This allowed the authors to focus their analysis on the $1D$ convolutions.<br />
<br />
For this case, if we denote the gradient signal $\frac{\partial l}{\partial y_{i,j}}$ by $u(t)$ and the kernel by $v(t)$, we have<br />
<br />
\begin{equation*}<br />
u(t)=\delta(t),\\ \quad v(t) = \sum_{m=0}^{k-1} \delta(t-m), \quad \text{where} \begin{cases} \delta(t)= 1\ \text{if}\ t=0, \\ \delta(t)= 0\ \text{if}\ t\neq 0, \end{cases}<br />
\end{equation*}<br />
and $t =0,1,-1,2,-2,...$ indexes the pixels.<br />
<br />
The gradient signal $o(t)$ on the input pixels can now be computed by convolving $u(t)$ with $n$ such $v(t)$'s so that $o(t) = u *v* ...*v$. <br />
<br />
Since convolution in time domain is equivalent to multiplication in Fourier domain, we can write<br />
<br />
\begin{equation*}<br />
U(w) = \sum_{t=-\infty}^{\infty} u(t) e^{-jwt}=1,\\<br />
V(w) = \sum_{t=-\infty}^{\infty} v(t) e^{-jwt}=\sum_{m=0}^{k-1} e^{-jwm},\\<br />
O(w) = F(o(t))=F(u(t)*v(t)*...*v(t)) = U(w).V(w)^n = \Big ( \sum_{m=0}^{k-1} e^{-jwm} \Big )^n,<br />
\end{equation*}<br />
<br />
where $O(w)$, $U(w)$, and $V(w)$ are discrete Fourier transformations of $o(t)$, $u(t)$, and $v(t)$.<br />
Next, we need to apply the inverse Fourier transform:<br />
\begin{equation*}<br />
o(t) = \frac{1}{2\pi} \int_{-\pi}^{\pi} (\sum_{m=0}^{k-1}e^{-j\omega m})^n e^{j \omega t} \ d\omega<br />
\end{equation*}<br />
<br />
<br />
Now let us consider two non-trivial cases.<br />
<br />
'''Case K=2:''' In this case $( \sum_{m=0}^{k-1} e^{-jwm} )^n = (1 + e^{-jw})^n$. Because $O(w)= \sum_{t=-\infty}^{\infty} o(t) e^{-jwt}= (1 + e^{-jw})^n$, we can think of $o(t)$ as coefficients of $e^{-jwt})$. Therefore, $o(t)= <br />
\begin{pmatrix} n\\t\end{pmatrix}$ is the standard binomial coefficients. As $n$ becomes large binomial coefficients distribute with respect to $t$ like a Gaussian distribution. More specifically, when $n \to \infty$ we can write<br />
<br />
<br />
\begin{equation*}<br />
\begin{pmatrix} n\\t \end{pmatrix} \sim \frac{2^n}{\sqrt{\frac{n\pi}{2}}}e^{-d^{2}/2n}, <br />
\end{equation*}<br />
<br />
where $d = n-2t$ (see [https://en.wikipedia.org/wiki/Binomial_coefficient Binomial coefficient]).<br />
<br />
'''Case K>2:''' In this case the coefficients are known as "extended binomial coefficients" or "polynomial<br />
coefficients", and they too distribute like Gaussian [5].<br />
<br />
=== Random Weights===<br />
Denote $g(i, j, p) = \frac{\partial l}{\partial x_{i,j}^p}$ as the gradient on the $p$th layer, and $g(i, j, n) = \frac{\partial l}{\partial y_{i,j}}$ . Then $g(, , 0)$ is the desired gradient image of the input. The backpropagation convolves $g(, , p)$ with the $k \times k$ kernel to get $g(, , p-1)$ for each p. So we can write<br />
<br />
\begin{equation*}<br />
g(i,j,p-1) = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} w_{a,b}^p g(i+a,i+b,p),<br />
\end{equation*}<br />
<br />
where $w_{a,b}^p$ is the convolution weight at $(a, b)$ in the convolution kernel on layer p. In this case, the initial weights are independently drawn from a fixed distribution with zero mean and variance $C$. By assuming that the gradients g are independent of the<br />
weights (linear networks only) and given that $\mathbb{E}_w[w_{a,b}^p] =0$<br />
<br />
\begin{equation*}<br />
\mathbb{E}_{w,input}[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} \mathbb{E}_w[w_{a,b}^p] \mathbb{E}_{input}[g(i+a,i+b,p)]=0,\\<br />
Var[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[w_{a,b}^p] Var[g(i+a,i+b,p)]= C\sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[g(i+a,i+b,p)].<br />
\end{equation*}<br />
<br />
Therefore, to get $Var[g(, , p-1)]$ we can convolve the gradient variance image $Var[g(, , p)]$ with a $k \times k$ kernel of 1’s, and then multiply it by $C$. Comparing this to the simplest case of all weights equal to one, we can see that the $g(, , 0)$ has a Gaussian shape, with only a slight<br />
change of having an extra $C^n$ constant factor multiplier on the variance gradient images, which does not affect the relative distribution within a receptive field.<br />
<br />
=== Non-uniform Kernels ===<br />
In the case of non-uniform weighting, when w(m)'s are normalized, we can simply use characteristic function to prove the Central Limit Theorem in this case. For $S_n = \sum_{i=1}^n$ $X_i$ and $X_i$’s are i.i.d. <br />
<br />
As n → ∞, the distribution of $\sqrt{n}(\frac{1}{n}S_n - E[X])$ converges to Gaussian $N(0,Var[X])$ in distribution. <br />
<br />
multinomial variables distributed according to $w(m)$’s, i.e. $p(X_i = m) = w(m)$, we have:<br />
<br />
\begin{equation*}<br />
E[S_n] = n\sum_{m=0}^{k-1} mw(m),\\<br />
Var[S_n] = n \left (\sum_{m=0}^{k-1} m^2w(m) - \left (\sum_{m=0}^{k-1} mw(m) \right )^2 \right ),<br />
\end{equation*}<br />
<br />
If we take one standard deviation as the effective receptive field (ERF) size which is roughly the radius of the ERF, then this size is<br />
$\sqrt{Var[S_n]} = \sqrt{nVar[X_i]} = \mathcal{O}(\sqrt{n})$.<br />
<br />
On the other hand, stacking more convolutional layers implies that the theoretical receptive field grows linearly, therefore relative to the theoretical receptive field, the ERF actually shrinks at a rate of $\mathcal{O}(1/\sqrt{n})$.<br />
<br />
With uniform weighting, we can see that ERF size grows linearly as a function of the kernel size $k$. Using $w(m) = \frac{1}{k}$<br />
<br />
\begin{equation*}<br />
\sqrt{Var[S_n]} = \sqrt{n}\sqrt{\sum_{m=0}^{k-1}\frac{m^2}{k} - \bigg(\sum_{m=0}^{k-1}\frac{m}{k}\bigg)^2} = \sqrt{\frac{n(k^2-1)}{12}} = \mathcal{O}(k\sqrt{n})<br />
\end{equation*}<br />
<br />
=== Non-linear Activation Functions===<br />
<br />
The math in this section is a bit "hand-wavy", as one of their reviewers wrote, and their conclusion (Gaussian-shape ERF) is not really well backed up by their experiments. The most important point take away here is that by the introduction of a nonlinear activation function, the gradients depends on the network's input as well.<br />
<br />
=== Dropout ===<br />
Dropout is a technique that sets each unit in a neural network randomly to zero during training,<br />
which has found great success as a regularizer to prevent deep networks from<br />
over-fitting. The authors show that dropout does not change the Gaussian ERF<br />
shape.<br />
<br />
=== Subsampling and Dilated Convolutions ===<br />
Subsampling reduces the resolution of the convolutional feature maps, and makes<br />
each of the following convolutional layers operate on a larger scale. It is<br />
therefore a great way to increase the receptive field. Subsampling followed by<br />
convolutional layers can be equivalently implemented as changing all the<br />
convolutional layers after subsampling from dense convolutions to dilated<br />
convolutions. Thus we can apply the same theory we developed above to understand<br />
networks with subsampling layers. However, with exponentially growing receptive<br />
field introduced by the subsampling or exponentially dilated convolutions, many<br />
more layers are needed to see the Gaussian shape clearly.<br />
<br />
=== Skip Connections ===<br />
Skip connections are another type of popular architecture designs for deep<br />
neural networks in general. Recent state-of-the-art models for image<br />
recognition, in particular the Residual Networks (ResNets) make extensive use of<br />
skip connections. The ResNet architecture is composed of residual blocks, each<br />
residual block has two pathways, one is a path of q (usually 2) convolutional<br />
layers plus nonlinearity and batch-normalization, the other one is a path of a<br />
skip connection that goes directly from the input to the output. The output is<br />
simply a sum of the results of the two pathways. Authors don't have explicit<br />
expression for the ERF size for skip connection, but it is smaller than the<br />
biggest receptive field possible, which is achieved when the pathway that goes<br />
through the convolutional layers are chosen in all residual block.<br />
<br />
=== Remarks ===<br />
The authors notice us about three critical assumptions in the analyses above: (1) all layers in the CNN use the same set of convolution weights. This is in general not true, however, when we apply the analysis of variance, the weight variance on all layers are usually the same up to a constant factor. (2) The convergence derived is convergence “in distribution”, as implied by the central limit theorem. So this is neither converging almost surely nor in probability, or rather, we are not able to guarantee convergence on any single model. (3) Although CLT gives the limit distribution of $\frac{1}{\sqrt{n}} S_n$, the distribution of $S_n$ does not have a limit, and its "deviation" from a corresponding normal distribution can be large on some finite set, but it still is Gaussian in terms of overall shape.<br />
<br />
= Verifying Theoretical Results =<br />
In all of the following experiments, a gradient signal of 1 was placed at the center of the output plane and 0 everywhere else, and then this gradient was backpropagated through the network to get input gradients. Also, random inputs, as well as proper random initialization of the kernels, were employed.<br />
<br />
<br />
'''ERFs are Gaussian distributed:''' By looking at the figure, [[File:understanding_ERF_fig1.png|thumbnail||600px]] we can observe Gaussian shapes for uniformly and randomly weighted convolution kernels without nonlinear activations, and near-Gaussian shapes for randomly weighted kernels with nonlinearity. Adding the ReLU nonlinearity makes the distribution a bit less Gaussian, as the ERF distribution depends on the input as well. Another reason is that ReLU units output exactly zero for half of its inputs and it is very easy to get a zero output for the center pixel on the output plane, which means no path from the receptive field can reach the output, hence the gradient is all zero. Here the ERFs are averaged over 20 runs with different random seed. <br />
<br />
<br />
<br />
<br />
<br />
Figures below show the ERF for networks with 20 layers of random weights, with different nonlinearities. Here the results are averaged both across 100 runs with different random weights as well as different random inputs. In this setting, the receptive fields are a lot more Gaussian-like. <br />
<br />
[[File:understanding_ERF_fig2.png|thumbnail|centre|400px]]<br />
<br />
<br />
''' <math>\sqrt{n}</math> absolute growth and <math>1/\sqrt{n}</math> relative shrinkage:''' The figure [[File:understanding_ERF_fig4.png|thumbnail||600px]] shows the change of ERF size and the relative ratio of ERF over theoretical RF wrt number of convolution layers. The fitted line for ERF size has the slope of 0.56 in log domain, while the line for ERF ratio has the slope of -0.43. This indicates ERF size is growing linearly wrt <math>\sqrt{n}</math> and ERF ratio is shrinking linearly wrt <math>1/\sqrt{n}</math>.<br />
They used 2 standard deviations as the measurement for ERF size, i.e. any pixel with a value greater than 1 - 95.45% of the center point is considered in ERF. The ERF size is represented by the square root of the number of pixels within ERF, while the theoretical RF size is the side length of the square in which all pixel has a non-zero impact on the output pixel, no matter how small. All experiments here are averaged over 20 runs.<br />
<br />
<br />
'''Subsampling & dilated convolution increases receptive field:''' The figure shows that the effect of subsampling and dilated convolution. The reference baseline is a CNN with 15 dense convolution layers. Its ERF is shown in the left-most figure. Replacing 3 of the 15 convolutional layers with stride-2 convolution results in the ERF for the ‘Subsample’ figure. Finally, replacing those 3 convolutional layers with dilated convolution with factor 2,4 and 8 gives the ‘Dilation’ figure. Both of them are able to increase the effect receptive field significantly. Note the ‘Dilation’ figure shows a rectangular ERF shape typical for dilated convolutions (why?).<br />
<br />
[[File:understanding_ERF_fig3.png|thumbnail|centre|400px]]<br />
<br />
== How the ERF evolves during training ==<br />
<br />
The authors looked at how the ERF of units in the top-most convolutional layers of a classification CNN and a semantic segmentation CNN evolve during training. For both tasks, they adopted the ResNet architecture which makes extensive use of skip-connections. As expected their analysis showed the ERF of these networks is significantly smaller than the theoretical receptive field. Also, as the networks learn, the ERF got bigger so that at the end of training was significantly larger than the initial ERF. <br />
<br />
The classification network was a ResNet with 17 residual blocks trained on the CIFAR-10 dataset. Figure shows the ERF on the 32x32 image space at the beginning of training (with randomly initialized weights) and at the end of training when it reaches best validation accuracy. Note that the theoretical receptive field of the network is actually 74x74, bigger than the image size, but the ERF is not filling the image completely. Comparing the results before and after training <br />
demonstrates that ERF has grown significantly.<br />
<br />
[[File:understanding_ERF_fig5.png|thumbnail|centre|500px]]<br />
<br />
The semantic segmentation network was trained on the CamVid dataset for urban scene segmentation. The 'front-end' of the model was a purely convolutional network that predicted the output at a slightly lower resolution. And then, a ResNet with 16 residual blocks interleaved with 4 subsampling operations each with a factor of 2 was implemented. Due to subsampling operations, the output was 1/16 of the input size. For this model, the theoretical RF of the top convolutional layer units was 505x505. However, as Figure shows the ERF only got a fraction of that with a diameter of 100 at the beginning of training, and at the end of training reached almost a diameter around 150.<br />
<br />
= Reduce Gaussian Damage =<br />
The Effective Receptive Field (ERF) usually decays quickly from the centre (like 2D Gaussian) and only takes a small portion of the theoretical Receptive Field (RF). This "Gaussian damage" is undesirable for tasks that require a large RF and to reduce it, the authors suggested two solutions:<br />
#'''New Initialization scheme''' to make the weights at the center of the convolution kernel to be smaller and the weights on the outside larger, which diffuses the concentration on the center out to the periphery. One way to implement this is to initialize the network with any initialization method, and then scale the weights according to a distribution that has a lower scale at the center and higher scale on the outside. They tested this solution for the CIFAR-10 classification task, with several random seeds. In a few cases, they get a 30% speed-up of training compared to the more standard initializations. But overall the benefit of this method is not always significant. This is only a partial solution as, no matter what is done to change the initial weights the ERF maintains a Gaussian distribution. Weight initialization is of importance for the deep learning model. Initial weights that are too large may result in exploding values during forward propagation or back propagation. One of the popular methods of weights initialization are the Batch normalization, proposed by Xavier et al. <br />
<br />
#'''Architectural changes of CNNs''' is the 'better' approach that may change the ERF in more fundamental ways. For example, instead of connecting each unit in a CNN to a local rectangular convolution window, we can sparsely connect each unit to a larger area in the lower layer using the same number of connections. Dilated convolution belongs to this category, but we may push even further and use sparse connections that are not grid-like.<br />
<br />
= Discussion =<br />
<br />
'''Connection to biological neural networks''': The authors established through their analysis that the ERF grows a lot slower than what was previously thought, which indicates that a lot of local information is still preserved even after many convolutional layers. This also contradicts some age-old relevant notions in deep biological networks. Another relevant observation from their analysis is that convolutional networks may automatically create a form of foveal representation.<br />
<br />
'''Connection to previous work on CNNs''': Though receptive fields in CNNs have not been studied extensively, some previous works on the topic explore how the variance does not change much when going through<br />
the network; utilizing which a good initialization scheme was developed for convolution layers. Researchers have also used visualization to show the importance of using natural-image priors and also what an activation of the convolutional layer would represent. Deconvolutional nets have been used to show the relation of pixels in the image and the neurons that are firing. <br />
<br />
= Summary & Conclusion =<br />
The authors showed, theoretically and experimentally, that the distribution of impact within the receptive field (the effective receptive field) is asymptotically Gaussian, and the ERF only takes up a fraction of the full theoretical receptive field. They also studied the effects of some standard CNN approaches on the effective receptive field. They found that dropout does not change the Gaussian ERF shape. Subsampling and dilated convolutions are effective ways to increase receptive field size quickly but skip-connections make ERFs smaller.<br />
<br />
They argued that since larger ERFs are required for higher performance, new methods to achieve larger ERF will not only help the network to train faster but may also improve performance.<br />
<br />
= Critique = <br />
<br />
The authors' finding on $\sqrt{n}$ absolute growth of Effective Receptive Field (ERF) suffers from a discrepancy in ERF definition between their theoretical analysis and their experiments. Namely, in the theoretical analysis for the non-uniform-kernel case, they considered one standard deviation as the ERF size. However, they used two standard deviations as the measure for ERF size in the experiments.<br />
<br />
It would be more practical if the paper also investigated the ERF for natural images (as opposed to random) as network input at least in the two cases where they examined trained networks. <br />
<br />
The authors claim that the ERF results in the experimental section have Gaussian shapes but they never prove this claim. For example, they could fit different 2D-functions, including 2D-Gaussian, to the kernels and show that 2D-Gaussian gives the best fit. Furthermore, the pictures are given as proof of the claim that the ERF has a Gaussian distribution only show the ERF of the center pixel of the output <math> y_{0,0} </math>. Intuitively, the ERF of a node near the boundary of the output layer may have a significantly different shape. This was not addressed in the paper.<br />
<br />
Another weakness is in the discussion section, where they make a connection to the biological networks. They jumped to disprove a well-observed phenomenon in the brain. The fact that the neurons in the higher areas of the visual hierarchy gradually lose their retinotopic property has been shown in a countless number of neuroscience studies. For example, [https://en.wikipedia.org/wiki/Grandmother_cell grandmother cells] do not care about the position of grandmother's face in the visual field. In general, the similarity between deep CNNs and biological visual systems is not as strong, hence we should take any generalization from CNNs to biological networks with a grain of salt.<br />
<br />
Spectrograms are visual representations of audio where the axes represent time, frequency and amplitude of the frequency. The ERF of a CNN, when applied to a spectrogram, doesn't necessarily have to be from a Gaussian towards the center. In fact, many receptive fields are trained to look for the peaks of troughs and cliffs, which essentially imply that the ERF will have more weightage towards the outside rather than the center.<br />
<br />
The paper talks about what ERF represents and how it can be increased but doesn't say how ERF can be used for improving the model accuracies by changing the configuration of the network, say the depth of the network, or kernel size etc. In addition, as an important part in Region-CNN, ERF can provide some useful information during object detection, it would be better if the authors could add analysis on with different ERF properties, how would the influence be over the mAP in object detection.<br />
<br />
= References =<br />
[1] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. "Understanding the effective receptive field in deep convolutional neural networks." In Advances in Neural Information Processing Systems, pp. 4898-4906. 2016.<br />
<br />
[2] Buessler, J.-L., Smagghe, P., & Urban, J.-P. (2014). Image receptive fields for artificial neural networks. Neurocomputing, 144(Supplement C), 258–270. https://doi.org/10.1016/j.neucom.2014.04.045<br />
<br />
[3] Dilated Convolutions in Neural Network - [http://www.erogol.com/dilated-convolution/]<br />
<br />
[4] http://cs231n.github.io/convolutional-networks/<br />
<br />
[5] Thorsten Neuschel. "A note on extended binomial coefficients." Journal of Integer Sequences, 17(2):3, 2014.<br />
<br />
[6] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_the_Effective_Receptive_Field_in_Deep_Convolutional_Neural_Networks&diff=29670Understanding the Effective Receptive Field in Deep Convolutional Neural Networks2017-11-07T19:11:21Z<p>Jdeng: /* References */</p>
<hr />
<div>= Introduction =<br />
== What is the Receptive Field (RF) of a unit? ==<br />
[[File:understanding_ERF_fig0.png|thumbnail|450px]]<br />
The receptive field of a unit is the region of input that is seen and responded to by the unit. When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume. Instead, we connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron (equivalently this is the filter size). Hence, the dept of the input is the extent of the connectivity along the depth axis [4]. For instance, if we take an RGB CIFAR-10 image which has the input size of 32x32x3 (height, width, channels), wherein we have a receptive field (or the filter size) of 5x5 - then each neuron in the convolutional layer will have weights to a 5x5x3 region for every input image giving a total of 5*5*3 = 75 weights. It is important to note that the extent of the connectivity along the depth axis is 3 as the depth of the input is 3 (i.e. the channels).<br />
<br />
An effective introduction to Receptive field arithmetic, including ways to calculate the receptive field of CNNs can be found [https://syncedreview.com/2017/05/11/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks/ here]<br />
<br />
== Why is RF important? ==<br />
The concept of receptive field is important for understanding and diagnosing how deep Convolutional neural networks (CNNs) work. Unlike in fully connected networks, where the value of each unit depends on the<br />
entire input to the network, in CNNs, anywhere in an input image outside the receptive field of a unit does not affect the value of that unit. Hence, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. The property of receptive field allows for the response to be most sensitive to a local region in the image and to specific stimuli; similar stimuli trigger activations of similar magnitudes [2]. The initialization of each receptive field depends on the neuron's degrees of freedom [2]. One example outlined in this paper is that "the weights can be either of the same sign or centered with zero mean. This latter case favors a response to the contrast between the central and peripheral region of the receptive field." [2]. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for every single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction.<br />
<br />
== How to increase RF size? ==<br />
''' Make the network deeper''' by stacking more layers, which increases the receptive field size linearly by theory, as<br />
each extra layer increases the receptive field size by the kernel size (more accurate to say kernel size-1).<br />
<br />
'''Add sub-sampling layers''' to increase the receptive field size multiplicatively. Actually, sub-sampling is simply AveragePooling with learnable weights per feature map. It acts like low pass filtering and then downsampling。<br />
<br />
Modern deep CNN architectures like the VGG networks and Residual Networks use a combination of these techniques.<br />
<br />
== Intuition behind Effective Receptive Fields ==<br />
The pixels at the center of an RF have a much larger impact on an output:<br />
* In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptive field have very few paths to propagate its impact. <br />
* In the backward pass, gradients from an output unit are propagated across all the paths, and therefore the central pixels have a much larger magnitude for the gradient from that output [More paths always mean larger gradient?].<br />
* Not all pixels in a receptive field contribute equally to an output unit's response.<br />
<br />
The authors prove that in many cases the distribution of impact in a receptive field distributes as a Gaussian. Since Gaussian distributions generally decay quickly from the center, the effective receptive field, only occupies a fraction of the theoretical receptive field.<br />
<br />
The authors have correlated the theory of effective receptive field with some empirical observations. One such observation is that the random initializations lead some deep CNNs to start with a small effective receptive field, which then grows on training, which indicates a bad initialization bias.<br />
<br />
= Theoretical Results =<br />
<br />
The authors wanted to mathematically characterize how much each input pixel in a receptive field can impact<br />
the output of a unit $n$ layers up the network, i.e. when $n \rightarrow \infty$. More specifically, assume that pixels on each layer are indexed by $(i,j)$ with their centre at $(0,0)$. If we denote the pixel on the $p$th layer as $x_{i,j}^p$ , with $x_{i,j}^0$ as the input to the network, and $y_{i,j}=x_{i,j}^n$ as the output on the $n$th layer, we want to know how much each $x_{i,j}^0$ contributes to $y_{0,0}$. The effective receptive field (ERF) of this central output unit can be defined as the region containing input pixels with a non-negligible impact on it. <br />
<br />
They used the partial derivative $\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$ as the measure of such impact, which can be computed using backpropagation. Assuming $l$ as an arbitrary loss by the chain rule we can write $\frac{\partial l}{\partial x_{i,j}^0} = \sum_{i',j'}\frac{\partial l}{\partial y_{i',j'}}\frac{\partial y_{i',j'}}{\partial x_{i,j}^0}$. Now if $\frac{\partial l}{\partial y_{0,0}} =1$ and $\frac{\partial l}{\partial y_{i,j}}=0$ for all $i \neq 0$ and $j \neq 0$, then $\frac{\partial l}{\partial x_{i,j}^0} =\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$.<br />
<br />
For networks without nonlinearity (i.e., linear networks), this measure is independent of the input and depends only on the weights of the network and (i, j), which clearly shows how the impact of the pixels in the receptive field distributes.<br />
<br />
===Simplest case: Stack of convolutional layers of weights equal to 1===<br />
<br />
The authors first considered the case of $n$ convolutional layers using $k \times k$ kernels of stride 1 and a single channel on each layer and no nonlinearity, and bias. <br />
<br />
<br />
For this special sub-case, the kernel was a $k \times k$ matrix of 1's. Since this kernel is separable to $k \times 1$ and $1 \times k$ matrices, the $2D$ convolution could be replaced by two $1D$ convolutions. This allowed the authors to focus their analysis on the $1D$ convolutions.<br />
<br />
For this case, if we denote the gradient signal $\frac{\partial l}{\partial y_{i,j}}$ by $u(t)$ and the kernel by $v(t)$, we have<br />
<br />
\begin{equation*}<br />
u(t)=\delta(t),\\ \quad v(t) = \sum_{m=0}^{k-1} \delta(t-m), \quad \text{where} \begin{cases} \delta(t)= 1\ \text{if}\ t=0, \\ \delta(t)= 0\ \text{if}\ t\neq 0, \end{cases}<br />
\end{equation*}<br />
and $t =0,1,-1,2,-2,...$ indexes the pixels.<br />
<br />
The gradient signal $o(t)$ on the input pixels can now be computed by convolving $u(t)$ with $n$ such $v(t)$'s so that $o(t) = u *v* ...*v$. <br />
<br />
Since convolution in time domain is equivalent to multiplication in Fourier domain, we can write<br />
<br />
\begin{equation*}<br />
U(w) = \sum_{t=-\infty}^{\infty} u(t) e^{-jwt}=1,\\<br />
V(w) = \sum_{t=-\infty}^{\infty} v(t) e^{-jwt}=\sum_{m=0}^{k-1} e^{-jwm},\\<br />
O(w) = F(o(t))=F(u(t)*v(t)*...*v(t)) = U(w).V(w)^n = \Big ( \sum_{m=0}^{k-1} e^{-jwm} \Big )^n,<br />
\end{equation*}<br />
<br />
where $O(w)$, $U(w)$, and $V(w)$ are discrete Fourier transformations of $o(t)$, $u(t)$, and $v(t)$.<br />
Next, we need to apply the inverse Fourier transform:<br />
\begin{equation*}<br />
o(t) = \frac{1}{2\pi} \int_{-\pi}^{\pi} (\sum_{m=0}^{k-1}e^{-j\omega m})^n e^{j \omega t} \ d\omega<br />
\end{equation*}<br />
<br />
<br />
Now let us consider two non-trivial cases.<br />
<br />
'''Case K=2:''' In this case $( \sum_{m=0}^{k-1} e^{-jwm} )^n = (1 + e^{-jw})^n$. Because $O(w)= \sum_{t=-\infty}^{\infty} o(t) e^{-jwt}= (1 + e^{-jw})^n$, we can think of $o(t)$ as coefficients of $e^{-jwt})$. Therefore, $o(t)= <br />
\begin{pmatrix} n\\t\end{pmatrix}$ is the standard binomial coefficients. As $n$ becomes large binomial coefficients distribute with respect to $t$ like a Gaussian distribution. More specifically, when $n \to \infty$ we can write<br />
<br />
<br />
\begin{equation*}<br />
\begin{pmatrix} n\\t \end{pmatrix} \sim \frac{2^n}{\sqrt{\frac{n\pi}{2}}}e^{-d^{2}/2n}, <br />
\end{equation*}<br />
<br />
where $d = n-2t$ (see [https://en.wikipedia.org/wiki/Binomial_coefficient Binomial coefficient]).<br />
<br />
'''Case K>2:''' In this case the coefficients are known as "extended binomial coefficients" or "polynomial<br />
coefficients", and they too distribute like Gaussian [5].<br />
<br />
=== Random Weights===<br />
Denote $g(i, j, p) = \frac{\partial l}{\partial x_{i,j}^p}$ as the gradient on the $p$th layer, and $g(i, j, n) = \frac{\partial l}{\partial y_{i,j}}$ . Then $g(, , 0)$ is the desired gradient image of the input. The backpropagation convolves $g(, , p)$ with the $k \times k$ kernel to get $g(, , p-1)$ for each p. So we can write<br />
<br />
\begin{equation*}<br />
g(i,j,p-1) = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} w_{a,b}^p g(i+a,i+b,p),<br />
\end{equation*}<br />
<br />
where $w_{a,b}^p$ is the convolution weight at $(a, b)$ in the convolution kernel on layer p. In this case, the initial weights are independently drawn from a fixed distribution with zero mean and variance $C$. By assuming that the gradients g are independent of the<br />
weights (linear networks only) and given that $\mathbb{E}_w[w_{a,b}^p] =0$<br />
<br />
\begin{equation*}<br />
\mathbb{E}_{w,input}[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} \mathbb{E}_w[w_{a,b}^p] \mathbb{E}_{input}[g(i+a,i+b,p)]=0,\\<br />
Var[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[w_{a,b}^p] Var[g(i+a,i+b,p)]= C\sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[g(i+a,i+b,p)].<br />
\end{equation*}<br />
<br />
Therefore, to get $Var[g(, , p-1)]$ we can convolve the gradient variance image $Var[g(, , p)]$ with a $k \times k$ kernel of 1’s, and then multiply it by $C$. Comparing this to the simplest case of all weights equal to one, we can see that the $g(, , 0)$ has a Gaussian shape, with only a slight<br />
change of having an extra $C^n$ constant factor multiplier on the variance gradient images, which does not affect the relative distribution within a receptive field.<br />
<br />
=== Non-uniform Kernels ===<br />
In the case of non-uniform weighting, when w(m)'s are normalized, we can simply use characteristic function to prove the Central Limit Theorem in this case. For $S_n = \sum_{i=1}^n$ $X_i$ and $X_i$’s are i.i.d. <br />
<br />
As n → ∞, the distribution of $\sqrt{n}(\frac{1}{n}S_n - E[X])$ converges to Gaussian $N(0,Var[X])$ in distribution. <br />
<br />
multinomial variables distributed according to $w(m)$’s, i.e. $p(X_i = m) = w(m)$, we have:<br />
<br />
\begin{equation*}<br />
E[S_n] = n\sum_{m=0}^{k-1} mw(m),\\<br />
Var[S_n] = n \left (\sum_{m=0}^{k-1} m^2w(m) - \left (\sum_{m=0}^{k-1} mw(m) \right )^2 \right ),<br />
\end{equation*}<br />
<br />
If we take one standard deviation as the effective receptive field (ERF) size which is roughly the radius of the ERF, then this size is<br />
$\sqrt{Var[S_n]} = \sqrt{nVar[X_i]} = \mathcal{O}(\sqrt{n})$.<br />
<br />
On the other hand, stacking more convolutional layers implies that the theoretical receptive field grows linearly, therefore relative to the theoretical receptive field, the ERF actually shrinks at a rate of $\mathcal{O}(1/\sqrt{n})$.<br />
<br />
With uniform weighting, we can see that ERF size grows linearly as a function of the kernel size $k$. Using $w(m) = \frac{1}{k}$<br />
<br />
\begin{equation*}<br />
\sqrt{Var[S_n]} = \sqrt{n}\sqrt{\sum_{m=0}^{k-1}\frac{m^2}{k} - \bigg(\sum_{m=0}^{k-1}\frac{m}{k}\bigg)^2} = \sqrt{\frac{n(k^2-1)}{12}} = \mathcal{O}(k\sqrt{n})<br />
\end{equation*}<br />
<br />
=== Non-linear Activation Functions===<br />
<br />
The math in this section is a bit "hand-wavy", as one of their reviewers wrote, and their conclusion (Gaussian-shape ERF) is not really well backed up by their experiments. The most important point take away here is that by the introduction of a nonlinear activation function, the gradients depends on the network's input as well.<br />
<br />
=== Dropout ===<br />
Dropout is a technique that sets each unit in a neural network randomly to zero during training,<br />
which has found great success as a regularizer to prevent deep networks from<br />
over-fitting. The authors show that dropout does not change the Gaussian ERF<br />
shape.<br />
<br />
=== Subsampling and Dilated Convolutions ===<br />
Subsampling reduces the resolution of the convolutional feature maps, and makes<br />
each of the following convolutional layers operate on a larger scale. It is<br />
therefore a great way to increase the receptive field. Subsampling followed by<br />
convolutional layers can be equivalently implemented as changing all the<br />
convolutional layers after subsampling from dense convolutions to dilated<br />
convolutions. Thus we can apply the same theory we developed above to understand<br />
networks with subsampling layers. However, with exponentially growing receptive<br />
field introduced by the subsampling or exponentially dilated convolutions, many<br />
more layers are needed to see the Gaussian shape clearly.<br />
<br />
=== Skip Connections ===<br />
Skip connections are another type of popular architecture designs for deep<br />
neural networks in general. Recent state-of-the-art models for image<br />
recognition, in particular the Residual Networks (ResNets) make extensive use of<br />
skip connections. The ResNet architecture is composed of residual blocks, each<br />
residual block has two pathways, one is a path of q (usually 2) convolutional<br />
layers plus nonlinearity and batch-normalization, the other one is a path of a<br />
skip connection that goes directly from the input to the output. The output is<br />
simply a sum of the results of the two pathways. Authors don't have explicit<br />
expression for the ERF size for skip connection, but it is smaller than the<br />
biggest receptive field possible, which is achieved when the pathway that goes<br />
through the convolutional layers are chosen in all residual block.<br />
<br />
=== Remarks ===<br />
The authors notice us about three critical assumptions in the analyses above: (1) all layers in the CNN use the same set of convolution weights. This is in general not true, however, when we apply the analysis of variance, the weight variance on all layers are usually the same up to a constant factor. (2) The convergence derived is convergence “in distribution”, as implied by the central limit theorem. So this is neither converging almost surely nor in probability, or rather, we are not able to guarantee convergence on any single model. (3) Although CLT gives the limit distribution of $\frac{1}{\sqrt{n}} S_n$, the distribution of $S_n$ does not have a limit, and its "deviation" from a corresponding normal distribution can be large on some finite set, but it still is Gaussian in terms of overall shape.<br />
<br />
= Verifying Theoretical Results =<br />
In all of the following experiments, a gradient signal of 1 was placed at the center of the output plane and 0 everywhere else, and then this gradient was backpropagated through the network to get input gradients. Also, random inputs, as well as proper random initialization of the kernels, were employed.<br />
<br />
<br />
'''ERFs are Gaussian distributed:''' By looking at the figure, [[File:understanding_ERF_fig1.png|thumbnail||600px]] we can observe Gaussian shapes for uniformly and randomly weighted convolution kernels without nonlinear activations, and near-Gaussian shapes for randomly weighted kernels with nonlinearity. Adding the ReLU nonlinearity makes the distribution a bit less Gaussian, as the ERF distribution depends on the input as well. Another reason is that ReLU units output exactly zero for half of its inputs and it is very easy to get a zero output for the center pixel on the output plane, which means no path from the receptive field can reach the output, hence the gradient is all zero. Here the ERFs are averaged over 20 runs with different random seed. <br />
<br />
<br />
<br />
<br />
<br />
Figures below show the ERF for networks with 20 layers of random weights, with different nonlinearities. Here the results are averaged both across 100 runs with different random weights as well as different random inputs. In this setting, the receptive fields are a lot more Gaussian-like. <br />
<br />
[[File:understanding_ERF_fig2.png|thumbnail|centre|400px]]<br />
<br />
<br />
''' <math>\sqrt{n}</math> absolute growth and <math>1/\sqrt{n}</math> relative shrinkage:''' The figure [[File:understanding_ERF_fig4.png|thumbnail||600px]] shows the change of ERF size and the relative ratio of ERF over theoretical RF wrt number of convolution layers. The fitted line for ERF size has the slope of 0.56 in log domain, while the line for ERF ratio has the slope of -0.43. This indicates ERF size is growing linearly wrt <math>\sqrt{n}</math> and ERF ratio is shrinking linearly wrt <math>1/\sqrt{n}</math>.<br />
They used 2 standard deviations as the measurement for ERF size, i.e. any pixel with a value greater than 1 - 95.45% of the center point is considered in ERF. The ERF size is represented by the square root of the number of pixels within ERF, while the theoretical RF size is the side length of the square in which all pixel has a non-zero impact on the output pixel, no matter how small. All experiments here are averaged over 20 runs.<br />
<br />
<br />
'''Subsampling & dilated convolution increases receptive field:''' The figure shows that the effect of subsampling and dilated convolution. The reference baseline is a CNN with 15 dense convolution layers. Its ERF is shown in the left-most figure. Replacing 3 of the 15 convolutional layers with stride-2 convolution results in the ERF for the ‘Subsample’ figure. Finally, replacing those 3 convolutional layers with dilated convolution with factor 2,4 and 8 gives the ‘Dilation’ figure. Both of them are able to increase the effect receptive field significantly. Note the ‘Dilation’ figure shows a rectangular ERF shape typical for dilated convolutions (why?).<br />
<br />
[[File:understanding_ERF_fig3.png|thumbnail|centre|400px]]<br />
<br />
== How the ERF evolves during training ==<br />
<br />
The authors looked at how the ERF of units in the top-most convolutional layers of a classification CNN and a semantic segmentation CNN evolve during training. For both tasks, they adopted the ResNet architecture which makes extensive use of skip-connections. As expected their analysis showed the ERF of these networks is significantly smaller than the theoretical receptive field. Also, as the networks learn, the ERF got bigger so that at the end of training was significantly larger than the initial ERF. <br />
<br />
The classification network was a ResNet with 17 residual blocks trained on the CIFAR-10 dataset. Figure shows the ERF on the 32x32 image space at the beginning of training (with randomly initialized weights) and at the end of training when it reaches best validation accuracy. Note that the theoretical receptive field of the network is actually 74x74, bigger than the image size, but the ERF is not filling the image completely. Comparing the results before and after training <br />
demonstrates that ERF has grown significantly.<br />
<br />
[[File:understanding_ERF_fig5.png|thumbnail|centre|500px]]<br />
<br />
The semantic segmentation network was trained on the CamVid dataset for urban scene segmentation. The 'front-end' of the model was a purely convolutional network that predicted the output at a slightly lower resolution. And then, a ResNet with 16 residual blocks interleaved with 4 subsampling operations each with a factor of 2 was implemented. Due to subsampling operations, the output was 1/16 of the input size. For this model, the theoretical RF of the top convolutional layer units was 505x505. However, as Figure shows the ERF only got a fraction of that with a diameter of 100 at the beginning of training, and at the end of training reached almost a diameter around 150.<br />
<br />
= Reduce Gaussian Damage =<br />
The Effective Receptive Field (ERF) usually decays quickly from the centre (like 2D Gaussian) and only takes a small portion of the theoretical Receptive Field (RF). This "Gaussian damage" is undesirable for tasks that require a large RF and to reduce it, the authors suggested two solutions:<br />
#'''New Initialization scheme''' to make the weights at the center of the convolution kernel to be smaller and the weights on the outside larger, which diffuses the concentration on the center out to the periphery. One way to implement this is to initialize the network with any initialization method, and then scale the weights according to a distribution that has a lower scale at the center and higher scale on the outside. They tested this solution for the CIFAR-10 classification task, with several random seeds. In a few cases, they get a 30% speed-up of training compared to the more standard initializations. But overall the benefit of this method is not always significant. This is only a partial solution as, no matter what is done to change the initial weights the ERF maintains a Gaussian distribution. Weight initialization is of importance for the deep learning model. , initial weights that are too large may result in exploding values during forward propagation or back propagation. One of the popular methods of weights initialization are the Batch normalization, proposed by Xavier et al. <br />
<br />
#'''Architectural changes of CNNs''' is the 'better' approach that may change the ERF in more fundamental ways. For example, instead of connecting each unit in a CNN to a local rectangular convolution window, we can sparsely connect each unit to a larger area in the lower layer using the same number of connections. Dilated convolution belongs to this category, but we may push even further and use sparse connections that are not grid-like.<br />
<br />
= Discussion =<br />
<br />
'''Connection to biological neural networks''': The authors established through their analysis that the ERF grows a lot slower than what was previously thought, which indicates that a lot of local information is still preserved even after many convolutional layers. This also contradicts some age-old relevant notions in deep biological networks. Another relevant observation from their analysis is that convolutional networks may automatically create a form of foveal representation.<br />
<br />
'''Connection to previous work on CNNs''': Though receptive fields in CNNs have not been studied extensively, some previous works on the topic explore how the variance does not change much when going through<br />
the network; utilizing which a good initialization scheme was developed for convolution layers. Researchers have also used visualization to show the importance of using natural-image priors and also what an activation of the convolutional layer would represent. Deconvolutional nets have been used to show the relation of pixels in the image and the neurons that are firing. <br />
<br />
= Summary & Conclusion =<br />
The authors showed, theoretically and experimentally, that the distribution of impact within the receptive field (the effective receptive field) is asymptotically Gaussian, and the ERF only takes up a fraction of the full theoretical receptive field. They also studied the effects of some standard CNN approaches on the effective receptive field. They found that dropout does not change the Gaussian ERF shape. Subsampling and dilated convolutions are effective ways to increase receptive field size quickly but skip-connections make ERFs smaller.<br />
<br />
They argued that since larger ERFs are required for higher performance, new methods to achieve larger ERF will not only help the network to train faster but may also improve performance.<br />
<br />
= Critique = <br />
<br />
The authors' finding on $\sqrt{n}$ absolute growth of Effective Receptive Field (ERF) suffers from a discrepancy in ERF definition between their theoretical analysis and their experiments. Namely, in the theoretical analysis for the non-uniform-kernel case, they considered one standard deviation as the ERF size. However, they used two standard deviations as the measure for ERF size in the experiments.<br />
<br />
It would be more practical if the paper also investigated the ERF for natural images (as opposed to random) as network input at least in the two cases where they examined trained networks. <br />
<br />
The authors claim that the ERF results in the experimental section have Gaussian shapes but they never prove this claim. For example, they could fit different 2D-functions, including 2D-Gaussian, to the kernels and show that 2D-Gaussian gives the best fit. Furthermore, the pictures are given as proof of the claim that the ERF has a Gaussian distribution only show the ERF of the center pixel of the output <math> y_{0,0} </math>. Intuitively, the ERF of a node near the boundary of the output layer may have a significantly different shape. This was not addressed in the paper.<br />
<br />
Another weakness is in the discussion section, where they make a connection to the biological networks. They jumped to disprove a well-observed phenomenon in the brain. The fact that the neurons in the higher areas of the visual hierarchy gradually lose their retinotopic property has been shown in a countless number of neuroscience studies. For example, [https://en.wikipedia.org/wiki/Grandmother_cell grandmother cells] do not care about the position of grandmother's face in the visual field. In general, the similarity between deep CNNs and biological visual systems is not as strong, hence we should take any generalization from CNNs to biological networks with a grain of salt.<br />
<br />
Spectrograms are visual representations of audio where the axes represent time, frequency and amplitude of the frequency. The ERF of a CNN, when applied to a spectrogram, doesn't necessarily have to be from a Gaussian towards the center. In fact, many receptive fields are trained to look for the peaks of troughs and cliffs, which essentially imply that the ERF will have more weightage towards the outside rather than the center.<br />
<br />
The paper talks about what ERF represents and how it can be increased but doesn't say how ERF can be used for improving the model accuracies by changing the configuration of the network, say the depth of the network, or kernel size etc. In addition, as an important part in Region-CNN, ERF can provide some useful information during object detection, it would be better if the authors could add analysis on with different ERF properties, how would the influence be over the mAP in object detection.<br />
<br />
= References =<br />
[1] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. "Understanding the effective receptive field in deep convolutional neural networks." In Advances in Neural Information Processing Systems, pp. 4898-4906. 2016.<br />
<br />
[2] Buessler, J.-L., Smagghe, P., & Urban, J.-P. (2014). Image receptive fields for artificial neural networks. Neurocomputing, 144(Supplement C), 258–270. https://doi.org/10.1016/j.neucom.2014.04.045<br />
<br />
[3] Dilated Convolutions in Neural Network - [http://www.erogol.com/dilated-convolution/]<br />
<br />
[4] http://cs231n.github.io/convolutional-networks/<br />
<br />
[5] Thorsten Neuschel. "A note on extended binomial coefficients." Journal of Integer Sequences, 17(2):3, 2014.<br />
<br />
[6] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Understanding_the_Effective_Receptive_Field_in_Deep_Convolutional_Neural_Networks&diff=29669Understanding the Effective Receptive Field in Deep Convolutional Neural Networks2017-11-07T19:08:48Z<p>Jdeng: /* Reduce Gaussian Damage */</p>
<hr />
<div>= Introduction =<br />
== What is the Receptive Field (RF) of a unit? ==<br />
[[File:understanding_ERF_fig0.png|thumbnail|450px]]<br />
The receptive field of a unit is the region of input that is seen and responded to by the unit. When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume. Instead, we connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron (equivalently this is the filter size). Hence, the dept of the input is the extent of the connectivity along the depth axis [4]. For instance, if we take an RGB CIFAR-10 image which has the input size of 32x32x3 (height, width, channels), wherein we have a receptive field (or the filter size) of 5x5 - then each neuron in the convolutional layer will have weights to a 5x5x3 region for every input image giving a total of 5*5*3 = 75 weights. It is important to note that the extent of the connectivity along the depth axis is 3 as the depth of the input is 3 (i.e. the channels).<br />
<br />
An effective introduction to Receptive field arithmetic, including ways to calculate the receptive field of CNNs can be found [https://syncedreview.com/2017/05/11/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks/ here]<br />
<br />
== Why is RF important? ==<br />
The concept of receptive field is important for understanding and diagnosing how deep Convolutional neural networks (CNNs) work. Unlike in fully connected networks, where the value of each unit depends on the<br />
entire input to the network, in CNNs, anywhere in an input image outside the receptive field of a unit does not affect the value of that unit. Hence, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. The property of receptive field allows for the response to be most sensitive to a local region in the image and to specific stimuli; similar stimuli trigger activations of similar magnitudes [2]. The initialization of each receptive field depends on the neuron's degrees of freedom [2]. One example outlined in this paper is that "the weights can be either of the same sign or centered with zero mean. This latter case favors a response to the contrast between the central and peripheral region of the receptive field." [2]. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for every single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction.<br />
<br />
== How to increase RF size? ==<br />
''' Make the network deeper''' by stacking more layers, which increases the receptive field size linearly by theory, as<br />
each extra layer increases the receptive field size by the kernel size (more accurate to say kernel size-1).<br />
<br />
'''Add sub-sampling layers''' to increase the receptive field size multiplicatively. Actually, sub-sampling is simply AveragePooling with learnable weights per feature map. It acts like low pass filtering and then downsampling。<br />
<br />
Modern deep CNN architectures like the VGG networks and Residual Networks use a combination of these techniques.<br />
<br />
== Intuition behind Effective Receptive Fields ==<br />
The pixels at the center of an RF have a much larger impact on an output:<br />
* In the forward pass, central pixels can propagate information to the output through many different paths, while the pixels in the outer area of the receptive field have very few paths to propagate its impact. <br />
* In the backward pass, gradients from an output unit are propagated across all the paths, and therefore the central pixels have a much larger magnitude for the gradient from that output [More paths always mean larger gradient?].<br />
* Not all pixels in a receptive field contribute equally to an output unit's response.<br />
<br />
The authors prove that in many cases the distribution of impact in a receptive field distributes as a Gaussian. Since Gaussian distributions generally decay quickly from the center, the effective receptive field, only occupies a fraction of the theoretical receptive field.<br />
<br />
The authors have correlated the theory of effective receptive field with some empirical observations. One such observation is that the random initializations lead some deep CNNs to start with a small effective receptive field, which then grows on training, which indicates a bad initialization bias.<br />
<br />
= Theoretical Results =<br />
<br />
The authors wanted to mathematically characterize how much each input pixel in a receptive field can impact<br />
the output of a unit $n$ layers up the network, i.e. when $n \rightarrow \infty$. More specifically, assume that pixels on each layer are indexed by $(i,j)$ with their centre at $(0,0)$. If we denote the pixel on the $p$th layer as $x_{i,j}^p$ , with $x_{i,j}^0$ as the input to the network, and $y_{i,j}=x_{i,j}^n$ as the output on the $n$th layer, we want to know how much each $x_{i,j}^0$ contributes to $y_{0,0}$. The effective receptive field (ERF) of this central output unit can be defined as the region containing input pixels with a non-negligible impact on it. <br />
<br />
They used the partial derivative $\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$ as the measure of such impact, which can be computed using backpropagation. Assuming $l$ as an arbitrary loss by the chain rule we can write $\frac{\partial l}{\partial x_{i,j}^0} = \sum_{i',j'}\frac{\partial l}{\partial y_{i',j'}}\frac{\partial y_{i',j'}}{\partial x_{i,j}^0}$. Now if $\frac{\partial l}{\partial y_{0,0}} =1$ and $\frac{\partial l}{\partial y_{i,j}}=0$ for all $i \neq 0$ and $j \neq 0$, then $\frac{\partial l}{\partial x_{i,j}^0} =\frac{\partial y_{0,0}}{\partial x_{i,j}^0}$.<br />
<br />
For networks without nonlinearity (i.e., linear networks), this measure is independent of the input and depends only on the weights of the network and (i, j), which clearly shows how the impact of the pixels in the receptive field distributes.<br />
<br />
===Simplest case: Stack of convolutional layers of weights equal to 1===<br />
<br />
The authors first considered the case of $n$ convolutional layers using $k \times k$ kernels of stride 1 and a single channel on each layer and no nonlinearity, and bias. <br />
<br />
<br />
For this special sub-case, the kernel was a $k \times k$ matrix of 1's. Since this kernel is separable to $k \times 1$ and $1 \times k$ matrices, the $2D$ convolution could be replaced by two $1D$ convolutions. This allowed the authors to focus their analysis on the $1D$ convolutions.<br />
<br />
For this case, if we denote the gradient signal $\frac{\partial l}{\partial y_{i,j}}$ by $u(t)$ and the kernel by $v(t)$, we have<br />
<br />
\begin{equation*}<br />
u(t)=\delta(t),\\ \quad v(t) = \sum_{m=0}^{k-1} \delta(t-m), \quad \text{where} \begin{cases} \delta(t)= 1\ \text{if}\ t=0, \\ \delta(t)= 0\ \text{if}\ t\neq 0, \end{cases}<br />
\end{equation*}<br />
and $t =0,1,-1,2,-2,...$ indexes the pixels.<br />
<br />
The gradient signal $o(t)$ on the input pixels can now be computed by convolving $u(t)$ with $n$ such $v(t)$'s so that $o(t) = u *v* ...*v$. <br />
<br />
Since convolution in time domain is equivalent to multiplication in Fourier domain, we can write<br />
<br />
\begin{equation*}<br />
U(w) = \sum_{t=-\infty}^{\infty} u(t) e^{-jwt}=1,\\<br />
V(w) = \sum_{t=-\infty}^{\infty} v(t) e^{-jwt}=\sum_{m=0}^{k-1} e^{-jwm},\\<br />
O(w) = F(o(t))=F(u(t)*v(t)*...*v(t)) = U(w).V(w)^n = \Big ( \sum_{m=0}^{k-1} e^{-jwm} \Big )^n,<br />
\end{equation*}<br />
<br />
where $O(w)$, $U(w)$, and $V(w)$ are discrete Fourier transformations of $o(t)$, $u(t)$, and $v(t)$.<br />
Next, we need to apply the inverse Fourier transform:<br />
\begin{equation*}<br />
o(t) = \frac{1}{2\pi} \int_{-\pi}^{\pi} (\sum_{m=0}^{k-1}e^{-j\omega m})^n e^{j \omega t} \ d\omega<br />
\end{equation*}<br />
<br />
<br />
Now let us consider two non-trivial cases.<br />
<br />
'''Case K=2:''' In this case $( \sum_{m=0}^{k-1} e^{-jwm} )^n = (1 + e^{-jw})^n$. Because $O(w)= \sum_{t=-\infty}^{\infty} o(t) e^{-jwt}= (1 + e^{-jw})^n$, we can think of $o(t)$ as coefficients of $e^{-jwt})$. Therefore, $o(t)= <br />
\begin{pmatrix} n\\t\end{pmatrix}$ is the standard binomial coefficients. As $n$ becomes large binomial coefficients distribute with respect to $t$ like a Gaussian distribution. More specifically, when $n \to \infty$ we can write<br />
<br />
<br />
\begin{equation*}<br />
\begin{pmatrix} n\\t \end{pmatrix} \sim \frac{2^n}{\sqrt{\frac{n\pi}{2}}}e^{-d^{2}/2n}, <br />
\end{equation*}<br />
<br />
where $d = n-2t$ (see [https://en.wikipedia.org/wiki/Binomial_coefficient Binomial coefficient]).<br />
<br />
'''Case K>2:''' In this case the coefficients are known as "extended binomial coefficients" or "polynomial<br />
coefficients", and they too distribute like Gaussian [5].<br />
<br />
=== Random Weights===<br />
Denote $g(i, j, p) = \frac{\partial l}{\partial x_{i,j}^p}$ as the gradient on the $p$th layer, and $g(i, j, n) = \frac{\partial l}{\partial y_{i,j}}$ . Then $g(, , 0)$ is the desired gradient image of the input. The backpropagation convolves $g(, , p)$ with the $k \times k$ kernel to get $g(, , p-1)$ for each p. So we can write<br />
<br />
\begin{equation*}<br />
g(i,j,p-1) = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} w_{a,b}^p g(i+a,i+b,p),<br />
\end{equation*}<br />
<br />
where $w_{a,b}^p$ is the convolution weight at $(a, b)$ in the convolution kernel on layer p. In this case, the initial weights are independently drawn from a fixed distribution with zero mean and variance $C$. By assuming that the gradients g are independent of the<br />
weights (linear networks only) and given that $\mathbb{E}_w[w_{a,b}^p] =0$<br />
<br />
\begin{equation*}<br />
\mathbb{E}_{w,input}[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} \mathbb{E}_w[w_{a,b}^p] \mathbb{E}_{input}[g(i+a,i+b,p)]=0,\\<br />
Var[g(i,j,p-1)] = \sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[w_{a,b}^p] Var[g(i+a,i+b,p)]= C\sum_{a=0}^{k-1} \sum_{b=0}^{k-1} Var[g(i+a,i+b,p)].<br />
\end{equation*}<br />
<br />
Therefore, to get $Var[g(, , p-1)]$ we can convolve the gradient variance image $Var[g(, , p)]$ with a $k \times k$ kernel of 1’s, and then multiply it by $C$. Comparing this to the simplest case of all weights equal to one, we can see that the $g(, , 0)$ has a Gaussian shape, with only a slight<br />
change of having an extra $C^n$ constant factor multiplier on the variance gradient images, which does not affect the relative distribution within a receptive field.<br />
<br />
=== Non-uniform Kernels ===<br />
In the case of non-uniform weighting, when w(m)'s are normalized, we can simply use characteristic function to prove the Central Limit Theorem in this case. For $S_n = \sum_{i=1}^n$ $X_i$ and $X_i$’s are i.i.d. <br />
<br />
As n → ∞, the distribution of $\sqrt{n}(\frac{1}{n}S_n - E[X])$ converges to Gaussian $N(0,Var[X])$ in distribution. <br />
<br />
multinomial variables distributed according to $w(m)$’s, i.e. $p(X_i = m) = w(m)$, we have:<br />
<br />
\begin{equation*}<br />
E[S_n] = n\sum_{m=0}^{k-1} mw(m),\\<br />
Var[S_n] = n \left (\sum_{m=0}^{k-1} m^2w(m) - \left (\sum_{m=0}^{k-1} mw(m) \right )^2 \right ),<br />
\end{equation*}<br />
<br />
If we take one standard deviation as the effective receptive field (ERF) size which is roughly the radius of the ERF, then this size is<br />
$\sqrt{Var[S_n]} = \sqrt{nVar[X_i]} = \mathcal{O}(\sqrt{n})$.<br />
<br />
On the other hand, stacking more convolutional layers implies that the theoretical receptive field grows linearly, therefore relative to the theoretical receptive field, the ERF actually shrinks at a rate of $\mathcal{O}(1/\sqrt{n})$.<br />
<br />
With uniform weighting, we can see that ERF size grows linearly as a function of the kernel size $k$. Using $w(m) = \frac{1}{k}$<br />
<br />
\begin{equation*}<br />
\sqrt{Var[S_n]} = \sqrt{n}\sqrt{\sum_{m=0}^{k-1}\frac{m^2}{k} - \bigg(\sum_{m=0}^{k-1}\frac{m}{k}\bigg)^2} = \sqrt{\frac{n(k^2-1)}{12}} = \mathcal{O}(k\sqrt{n})<br />
\end{equation*}<br />
<br />
=== Non-linear Activation Functions===<br />
<br />
The math in this section is a bit "hand-wavy", as one of their reviewers wrote, and their conclusion (Gaussian-shape ERF) is not really well backed up by their experiments. The most important point take away here is that by the introduction of a nonlinear activation function, the gradients depends on the network's input as well.<br />
<br />
=== Dropout ===<br />
Dropout is a technique that sets each unit in a neural network randomly to zero during training,<br />
which has found great success as a regularizer to prevent deep networks from<br />
over-fitting. The authors show that dropout does not change the Gaussian ERF<br />
shape.<br />
<br />
=== Subsampling and Dilated Convolutions ===<br />
Subsampling reduces the resolution of the convolutional feature maps, and makes<br />
each of the following convolutional layers operate on a larger scale. It is<br />
therefore a great way to increase the receptive field. Subsampling followed by<br />
convolutional layers can be equivalently implemented as changing all the<br />
convolutional layers after subsampling from dense convolutions to dilated<br />
convolutions. Thus we can apply the same theory we developed above to understand<br />
networks with subsampling layers. However, with exponentially growing receptive<br />
field introduced by the subsampling or exponentially dilated convolutions, many<br />
more layers are needed to see the Gaussian shape clearly.<br />
<br />
=== Skip Connections ===<br />
Skip connections are another type of popular architecture designs for deep<br />
neural networks in general. Recent state-of-the-art models for image<br />
recognition, in particular the Residual Networks (ResNets) make extensive use of<br />
skip connections. The ResNet architecture is composed of residual blocks, each<br />
residual block has two pathways, one is a path of q (usually 2) convolutional<br />
layers plus nonlinearity and batch-normalization, the other one is a path of a<br />
skip connection that goes directly from the input to the output. The output is<br />
simply a sum of the results of the two pathways. Authors don't have explicit<br />
expression for the ERF size for skip connection, but it is smaller than the<br />
biggest receptive field possible, which is achieved when the pathway that goes<br />
through the convolutional layers are chosen in all residual block.<br />
<br />
=== Remarks ===<br />
The authors notice us about three critical assumptions in the analyses above: (1) all layers in the CNN use the same set of convolution weights. This is in general not true, however, when we apply the analysis of variance, the weight variance on all layers are usually the same up to a constant factor. (2) The convergence derived is convergence “in distribution”, as implied by the central limit theorem. So this is neither converging almost surely nor in probability, or rather, we are not able to guarantee convergence on any single model. (3) Although CLT gives the limit distribution of $\frac{1}{\sqrt{n}} S_n$, the distribution of $S_n$ does not have a limit, and its "deviation" from a corresponding normal distribution can be large on some finite set, but it still is Gaussian in terms of overall shape.<br />
<br />
= Verifying Theoretical Results =<br />
In all of the following experiments, a gradient signal of 1 was placed at the center of the output plane and 0 everywhere else, and then this gradient was backpropagated through the network to get input gradients. Also, random inputs, as well as proper random initialization of the kernels, were employed.<br />
<br />
<br />
'''ERFs are Gaussian distributed:''' By looking at the figure, [[File:understanding_ERF_fig1.png|thumbnail||600px]] we can observe Gaussian shapes for uniformly and randomly weighted convolution kernels without nonlinear activations, and near-Gaussian shapes for randomly weighted kernels with nonlinearity. Adding the ReLU nonlinearity makes the distribution a bit less Gaussian, as the ERF distribution depends on the input as well. Another reason is that ReLU units output exactly zero for half of its inputs and it is very easy to get a zero output for the center pixel on the output plane, which means no path from the receptive field can reach the output, hence the gradient is all zero. Here the ERFs are averaged over 20 runs with different random seed. <br />
<br />
<br />
<br />
<br />
<br />
Figures below show the ERF for networks with 20 layers of random weights, with different nonlinearities. Here the results are averaged both across 100 runs with different random weights as well as different random inputs. In this setting, the receptive fields are a lot more Gaussian-like. <br />
<br />
[[File:understanding_ERF_fig2.png|thumbnail|centre|400px]]<br />
<br />
<br />
''' <math>\sqrt{n}</math> absolute growth and <math>1/\sqrt{n}</math> relative shrinkage:''' The figure [[File:understanding_ERF_fig4.png|thumbnail||600px]] shows the change of ERF size and the relative ratio of ERF over theoretical RF wrt number of convolution layers. The fitted line for ERF size has the slope of 0.56 in log domain, while the line for ERF ratio has the slope of -0.43. This indicates ERF size is growing linearly wrt <math>\sqrt{n}</math> and ERF ratio is shrinking linearly wrt <math>1/\sqrt{n}</math>.<br />
They used 2 standard deviations as the measurement for ERF size, i.e. any pixel with a value greater than 1 - 95.45% of the center point is considered in ERF. The ERF size is represented by the square root of the number of pixels within ERF, while the theoretical RF size is the side length of the square in which all pixel has a non-zero impact on the output pixel, no matter how small. All experiments here are averaged over 20 runs.<br />
<br />
<br />
'''Subsampling & dilated convolution increases receptive field:''' The figure shows that the effect of subsampling and dilated convolution. The reference baseline is a CNN with 15 dense convolution layers. Its ERF is shown in the left-most figure. Replacing 3 of the 15 convolutional layers with stride-2 convolution results in the ERF for the ‘Subsample’ figure. Finally, replacing those 3 convolutional layers with dilated convolution with factor 2,4 and 8 gives the ‘Dilation’ figure. Both of them are able to increase the effect receptive field significantly. Note the ‘Dilation’ figure shows a rectangular ERF shape typical for dilated convolutions (why?).<br />
<br />
[[File:understanding_ERF_fig3.png|thumbnail|centre|400px]]<br />
<br />
== How the ERF evolves during training ==<br />
<br />
The authors looked at how the ERF of units in the top-most convolutional layers of a classification CNN and a semantic segmentation CNN evolve during training. For both tasks, they adopted the ResNet architecture which makes extensive use of skip-connections. As expected their analysis showed the ERF of these networks is significantly smaller than the theoretical receptive field. Also, as the networks learn, the ERF got bigger so that at the end of training was significantly larger than the initial ERF. <br />
<br />
The classification network was a ResNet with 17 residual blocks trained on the CIFAR-10 dataset. Figure shows the ERF on the 32x32 image space at the beginning of training (with randomly initialized weights) and at the end of training when it reaches best validation accuracy. Note that the theoretical receptive field of the network is actually 74x74, bigger than the image size, but the ERF is not filling the image completely. Comparing the results before and after training <br />
demonstrates that ERF has grown significantly.<br />
<br />
[[File:understanding_ERF_fig5.png|thumbnail|centre|500px]]<br />
<br />
The semantic segmentation network was trained on the CamVid dataset for urban scene segmentation. The 'front-end' of the model was a purely convolutional network that predicted the output at a slightly lower resolution. And then, a ResNet with 16 residual blocks interleaved with 4 subsampling operations each with a factor of 2 was implemented. Due to subsampling operations, the output was 1/16 of the input size. For this model, the theoretical RF of the top convolutional layer units was 505x505. However, as Figure shows the ERF only got a fraction of that with a diameter of 100 at the beginning of training, and at the end of training reached almost a diameter around 150.<br />
<br />
= Reduce Gaussian Damage =<br />
The Effective Receptive Field (ERF) usually decays quickly from the centre (like 2D Gaussian) and only takes a small portion of the theoretical Receptive Field (RF). This "Gaussian damage" is undesirable for tasks that require a large RF and to reduce it, the authors suggested two solutions:<br />
#'''New Initialization scheme''' to make the weights at the center of the convolution kernel to be smaller and the weights on the outside larger, which diffuses the concentration on the center out to the periphery. One way to implement this is to initialize the network with any initialization method, and then scale the weights according to a distribution that has a lower scale at the center and higher scale on the outside. They tested this solution for the CIFAR-10 classification task, with several random seeds. In a few cases, they get a 30% speed-up of training compared to the more standard initializations. But overall the benefit of this method is not always significant. This is only a partial solution as, no matter what is done to change the initial weights the ERF maintains a Gaussian distribution. Weight initialization is of importance for the deep learning model. , initial weights that are too large may result in exploding values during forward propagation or back propagation. One of the popular methods of weights initialization are the Batch normalization, proposed by Xavier et al. <br />
<br />
#'''Architectural changes of CNNs''' is the 'better' approach that may change the ERF in more fundamental ways. For example, instead of connecting each unit in a CNN to a local rectangular convolution window, we can sparsely connect each unit to a larger area in the lower layer using the same number of connections. Dilated convolution belongs to this category, but we may push even further and use sparse connections that are not grid-like.<br />
<br />
= Discussion =<br />
<br />
'''Connection to biological neural networks''': The authors established through their analysis that the ERF grows a lot slower than what was previously thought, which indicates that a lot of local information is still preserved even after many convolutional layers. This also contradicts some age-old relevant notions in deep biological networks. Another relevant observation from their analysis is that convolutional networks may automatically create a form of foveal representation.<br />
<br />
'''Connection to previous work on CNNs''': Though receptive fields in CNNs have not been studied extensively, some previous works on the topic explore how the variance does not change much when going through<br />
the network; utilizing which a good initialization scheme was developed for convolution layers. Researchers have also used visualization to show the importance of using natural-image priors and also what an activation of the convolutional layer would represent. Deconvolutional nets have been used to show the relation of pixels in the image and the neurons that are firing. <br />
<br />
= Summary & Conclusion =<br />
The authors showed, theoretically and experimentally, that the distribution of impact within the receptive field (the effective receptive field) is asymptotically Gaussian, and the ERF only takes up a fraction of the full theoretical receptive field. They also studied the effects of some standard CNN approaches on the effective receptive field. They found that dropout does not change the Gaussian ERF shape. Subsampling and dilated convolutions are effective ways to increase receptive field size quickly but skip-connections make ERFs smaller.<br />
<br />
They argued that since larger ERFs are required for higher performance, new methods to achieve larger ERF will not only help the network to train faster but may also improve performance.<br />
<br />
= Critique = <br />
<br />
The authors' finding on $\sqrt{n}$ absolute growth of Effective Receptive Field (ERF) suffers from a discrepancy in ERF definition between their theoretical analysis and their experiments. Namely, in the theoretical analysis for the non-uniform-kernel case, they considered one standard deviation as the ERF size. However, they used two standard deviations as the measure for ERF size in the experiments.<br />
<br />
It would be more practical if the paper also investigated the ERF for natural images (as opposed to random) as network input at least in the two cases where they examined trained networks. <br />
<br />
The authors claim that the ERF results in the experimental section have Gaussian shapes but they never prove this claim. For example, they could fit different 2D-functions, including 2D-Gaussian, to the kernels and show that 2D-Gaussian gives the best fit. Furthermore, the pictures are given as proof of the claim that the ERF has a Gaussian distribution only show the ERF of the center pixel of the output <math> y_{0,0} </math>. Intuitively, the ERF of a node near the boundary of the output layer may have a significantly different shape. This was not addressed in the paper.<br />
<br />
Another weakness is in the discussion section, where they make a connection to the biological networks. They jumped to disprove a well-observed phenomenon in the brain. The fact that the neurons in the higher areas of the visual hierarchy gradually lose their retinotopic property has been shown in a countless number of neuroscience studies. For example, [https://en.wikipedia.org/wiki/Grandmother_cell grandmother cells] do not care about the position of grandmother's face in the visual field. In general, the similarity between deep CNNs and biological visual systems is not as strong, hence we should take any generalization from CNNs to biological networks with a grain of salt.<br />
<br />
Spectrograms are visual representations of audio where the axes represent time, frequency and amplitude of the frequency. The ERF of a CNN, when applied to a spectrogram, doesn't necessarily have to be from a Gaussian towards the center. In fact, many receptive fields are trained to look for the peaks of troughs and cliffs, which essentially imply that the ERF will have more weightage towards the outside rather than the center.<br />
<br />
The paper talks about what ERF represents and how it can be increased but doesn't say how ERF can be used for improving the model accuracies by changing the configuration of the network, say the depth of the network, or kernel size etc. In addition, as an important part in Region-CNN, ERF can provide some useful information during object detection, it would be better if the authors could add analysis on with different ERF properties, how would the influence be over the mAP in object detection.<br />
<br />
= References =<br />
[1] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. "Understanding the effective receptive field in deep convolutional neural networks." In Advances in Neural Information Processing Systems, pp. 4898-4906. 2016.<br />
<br />
[2] Buessler, J.-L., Smagghe, P., & Urban, J.-P. (2014). Image receptive fields for artificial neural networks. Neurocomputing, 144(Supplement C), 258–270. https://doi.org/10.1016/j.neucom.2014.04.045<br />
<br />
[3] Dilated Convolutions in Neural Network - [http://www.erogol.com/dilated-convolution/]<br />
<br />
[4] http://cs231n.github.io/convolutional-networks/<br />
<br />
[5] Thorsten Neuschel. "A note on extended binomial coefficients." Journal of Integer Sequences, 17(2):3, 2014.</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_can_Multi-Site_Datasets_be_Pooled_for_Regression%3F_Hypothesis_Tests,_l2-consistency_and_Neuroscience_Applications:_Summary&diff=29145When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications: Summary2017-11-02T18:00:03Z<p>Jdeng: /* Lasso Regression and Model Selection: */</p>
<hr />
<div><br />
This page is a summary for this ICML 2017 paper[1].<br />
== Introduction ==<br />
===Some Basic Concepts and Issues===<br />
While the challenges posed<br />
by large-scale datasets are compelling, one is often faced<br />
with a fairly distinct set of technical issues for studies in biological<br />
and health sciences. For instance, a sizable portion of scientific research is carried out by small or medium sized<br />
groups supported by modest budgets. Hence, there are financial<br />
constraints on the number of experiments and/or number<br />
of participants within a trial, leading to small datasets. Similar datasets from multiple sites can be pooled to potentially<br />
improve statistical power and address the above issue. In reality, when analysis based a study/experiment, there comes about interesting follow-up questions during the course of the study; the purpose of the paper explore the ideology that when pooling the follow-up questions along with the original data set facilitate as necessities to deduce a viable prediction.<br />
====Regression Problems====<br />
Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in the presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things[2]:<br />
*Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)<br />
*Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features<br />
====Ridge Regression and Overfitting:====<br />
Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity [9]. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.<br />
<br />
Ridge regression is commonly used in econometrics and thus machine learning. When fitting a model, unnecessary inputs or inputs with co-linearity might bring disastrously huge coefficients (with a large variance).<br />
Ridge regression performs L2 regularization, i.e. it adds a factor of sum of squared coefficients in the optimization objective. Thus, ridge regression optimizes the following:<br />
*''Objective = RSS + λ * (sum of square of coefficients)''<br />
Note that performing ridge regression is equivalent to minimizing RSS ( Residual Sum of Squares) under the constraint that sum of squared coefficients is less than some function of λ, say s(λ). Ridge regression usually utilizes the method of cross-validation where we train the model on the training set using different values of λ and optimizing the above objective function. Then each of those model (each trained with different λ's) are tested on the validation set to evaluate their performance.<br />
<br />
====Lasso Regression and Model Selection:====<br />
LASSO stands for Least Absolute Shrinkage and Selection Operator. <br />
Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following.<br />
*''Objective = RSS + λ * (sum of absolute value of coefficients)''<br />
#λ = 0: Same coefficients as simple linear regression<br />
#λ = ∞: All coefficients zero (same logic as before)<br />
#λ < α < ∞: coefficients between 0 and that of simple linear regression<br />
A feature of Lasso regression is its job as a selection operator, i.e. it usually shrinks a part of coefficients to zero, while keeping the values of other coefficients. Thus it can be used in opting unnecessary coefficients out of the model.<br />
<br />
To describe this, let us rewrite the Lasso regression $\min_\beta ||y-X\beta||^2+\lambda||\beta||_1$ and ridge regression $\min_\beta ||y-X\beta||^2+\lambda||\beta||_2$ to its dual form<br />
\[<br />
\text{Ridge Regression} \quad \min_\beta ||y-X\beta||^2, \quad \text{subject to } || \beta ||_2 \leq t <br />
\]<br />
<br />
\[<br />
\text{Lasso Regression} \quad \min_\beta ||y-X\beta||^2, \quad \text{subject to } || \beta ||_1 \leq t <br />
\]<br />
Then the graph from Chapter 3 of Hastie et al. (2009) demonstrates how the Lasso regression shrinks a part of coefficients to zero.<br />
[[File:lasso.jpg|thumb|alt=Alt text|]]<br />
<br />
Another type of regression model that is worth mentioning here is what we call Elastic Net Regression. This type of regression model is utilizing both L1 and L2 regularization, namely combining the regularization techniques used in lasso regression and ridge regression together in the objective function. This type of regression could also be of possible interest to be applied in the context of this paper. Its objective function is shown below, where we can see both the sum of absolute value of coefficients and the sum of square of coefficients are included: <br />
<math> \hat{\beta} = argmin ||y – X \beta||^2 – λ_2 ||\beta||^2 – λ_1||\beta|| </math> w.r.t. <math>\beta</math><br />
<br />
====Bias-Variance Trade-Off====<br />
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).<br />
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting)[3].<br />
Mean square error (MSE) is defined by (variance + squared bias).<br />
A thing to mention is the following theorem:<br />
* For ridge regression, there '''exists''' a certain λ such that the MSE of coefficients calculated by ridge regression is smaller than that calculated by direct regression.<br />
<br />
===Related Work===<br />
====Meta-analysis approaches====<br />
Meta analysis is a statistical analysis which combines the results of several studies. There are several methods for non-imaging Meta analysis: p-value combining, fixed effects model, random effects model, and Meta regression. When datasets at different sites cannot be shared or pooled, various strategies exist that cumulate the general findings from analyses on different datasets. However, minor violations of assumptions can lead to misleading scientific conclusions (Greco et al., 2013), and substantial personal judgment (and expertise) is needed to conduct them.<br />
<br />
====Domain adaptation/shift====<br />
The idea of addressing “shift” within datasets has been rigorously studied within statistical machine learning. However, these focuses on the algorithm itself and do not address the issue<br />
of whether pooling the datasets, after applying the calculated adaptation (i.e., transformation), is beneficial. The goal in this work is to assess whether multiple datasets can be pooled — either before or usually after applying the best domain adaptation methods — for improving our estimation of the relevant coefficients within linear regression. A hypothesis test is proposed to directly address this question.<br />
<br />
====Simultaneous High dimensional Inference====<br />
Simultaneous high dimensional inference models are an active research topic in statistics. Multi sample- splitting takes half of the data set for feature selection and the remaining portion of the data set for calculating p values. The authors use contributions in this area to extend their results to a higher dimensional setting.<br />
<br />
==The Hypothesis Test==<br />
The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is<br />
*<math>min_{β} \frac{1}{n}\left \Vert y-Xβ \right \|_2^2</math> ........ (1) <br />
Where $X ∈ R^{n×p}$ and $y ∈ R^{n×1}$ denote the feature matrix of predictors and the response vector respectively. <br />
If k denotes the number of sites, a domain adaptation scheme needs to be applied to account for the distributional shifts between the k different predictors <math>\lbrace X_i \rbrace_{i=1}^{k} </math>, and then run a regression model. If the underlying “concept” (i.e., predictors and responses relationship) can be assumed to be the same across the different sites, then it is reasonable to impose the same β for all sites. For example, the influence of CSF protein measurements on cognitive scores of an individual may be invariant to demographics. if the distributional mismatch correction is imperfect, we may define ∆ βi = βi − β∗ where i ∈ {1,...,k} as the residual difference between the site-specific coefficients and the true shared coefficient vector (in the ideal case, we have ∆ βi = 0)[1].<br />
Therefore we derive the Multi-Site Regression equation ( Eq 2) where <math>\tau_i</math> is the weighting parameter for each site<br />
*<math>min_{β} \displaystyle \sum_{i=1}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2}</math> ......... (2)<br />
where for each site i we have $y_i = X_iβ_i +\epsilon_i$ and $\epsilon_i ∼ N (0, σ^2_i) $<br />
<br />
===Separate Regression or Shared Regression ?===<br />
Since the underlying relationship between predictors and responses is the same across the different datasets ( from which its pooled), estimates of <math>\beta_i</math> across all k sites are restricted to be the same. Without this constraint , (3) is equivalent to fitting a regression separately on each site. To explore whether this constraint improves estimation, the Mean Square Error (MSE) needs to be examined[1]. Hence, using site 1 as the reference, and setting <math>\tau_1</math> = 1 in (2) and considering <math>\beta*=\beta_1</math>,<br />
*<math>min_{β} \frac{1}{n}\left \Vert y_1-X_1β \right \|_2^2 + \displaystyle \sum_{i=2}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2}</math> .........(3)<br />
To evaluate whether MSE is reduced, we first need to quantify the change in the bias and variance of (3) compared to (1).<br />
<br />
====Case 1: Sharing all <math>\beta</math>s====<br />
<math>n_i </math>: sample size of site i <br/><br />
<math>\hat{β}_i </math>: regression estimate from a specific site i. <br/><br />
<math>\Delta β^T </math>: length ''kp'' vector<br/><br />
<math>\hat{\Sigma}_i </math>: the sample covariance matrix of the predictors from site i.<br/><br />
<math>G \in\mathbb{R}^{(k-1)p \times (k-1)p} </math>: the covariance matrix of <math>\Delta\hat{β} </math>, with <math>G_{ii}=\left(n_1\hat{\Sigma}_1 \right)^{-1} + \left(n_i\tau_i^2\hat{\Sigma}_i \right)^{-1} </math> and <math>G_{ij}=\left(n_1\hat{\Sigma}_1 \right)^{-1} </math>, <math>i\neq j </math><br/><br />
<br />
[[File:Equation_4567.png|thumb|alt=Alt text|]]Lemma 2.2 bounds the increase in bias and reduction in variance. Theorem 2.3 is the author's main test result.Although <math>\sigma_i</math> is typically<br />
unknown, it can be easily replaced using its site specific estimation. Theorem 2.3 implies that we can conduct a non-central <math>\chi^2</math> distribution test based on the statistic.<br />
<br />
<br />
Theorem 2.3 implies that the sites, in fact, do not even need to share the full dataset to assess whether pooling will be useful. Instead, the test only requires very high-level statistical information such as <math>\hat{\beta}_i,\hat{\Sigma}_i,\sigma_i</math> and <math>n_i</math> for all participating sites – which can be transferred without computational overhead. <br />
<br />
One can find R code for the hypothesis test for Case 1 in https://github.com/hzhoustat/ICML2017 as provided by the authors. In particular the Hypotest_allparam.R script provides the hypothesis test whereas Simultest_allparam.R provides some simulation examples that illustrate the application of the test under various different settings.<br />
<br />
====Case 2: Sharing a subset of <math>\beta</math>s====<br />
For example, socio-economic status may (or may not) have a significant association with a health outcome (response) depending on the country of the study (e.g., insurance coverage policies). Unlike Case 1, <math>\beta</math> cannot be considered to be the same across all sites. The model in (3) will now include another design matrix of predictors <math>Z\in R^{n*q} </math>and corresponding coefficients <math>\gamma_i</math> for each site i,<br />
<br />
<br />
<math>min_{β,\gamma} \sum_{i=1}^{k}\tau_i^2\left \Vert y_i-X_iβ-Z_i\gamma_i \right \|_2^2</math> ... (9)<br />
<br />
where<br />
<br />
<math>y_i=X_i \beta^* + X_i \Delta \beta_i + Z_i \gamma_i^* + \epsilon_i, \tau_1=1</math> ... (10)<br />
<br />
<br />
While evaluating whether the MSE of <math>\beta</math> reduces, the MSE change in <math>\gamma</math> is ignored because they correspond to site-specific variables. If <math>\hat{\beta}</math>is close to the “true” <math>\beta^*</math>, it will<br />
also enable a better estimation of site-specific variables[1]<br />
<br />
One can find R code for the hypothesis test for Case 2 in https://github.com/hzhoustat/ICML2017 as provided by the authors. In particular the Hypotest_subparam.R script provides the hypothesis test whereas Simultest_subparam.R provides some simulation examples that illustrate the application of the test under various different settings.<br />
<br />
==Sparse Multi-Site Lasso and High Dimensional Pooling==<br />
Pooling multi-site data in the high-dimensional setting where the number of predictors p is much larger than number of subjects n studied ( p>>n) leads to a high sparsity condition where many variables have their coefficients with limits tending to 0. Lasso Variable Selection helps in selecting the right coefficients for representing the relationship between the predictors and subjects <br />
<br />
===<math>\ell_2</math>-consistency===<br />
----<br />
[[File:MSEs and Hypothesis Test Results.png|thumb|alt=Alt text|MSE vs Sample Size plots]]<br />
[[File:Sparse Multisite Lasso.png|thumb|300x500|alt=Alt text|Sparse Multi-Site Lasso]]In the background of asymptotic analysis and approximations, the Lasso estimator is not variable selection consistent if the "Irrepresentable Condition" fails[7]. The Irrepresentable Condition: Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Which means, even if the exact sparsity pattern might not be recovered, the estimator can still be a good approximation to the truth. This also suggests that, for Lasso, estimation consistency might be easier to achieve than variable selection consistency.In classical regression, <math>\ell_2</math> consistency properties are well known. Imposing the same <math>\beta</math> across sites works in (3) because we understand its consistency. In contrast, in the case where p>>n, one cannot enforce a shared coefficient vector for all sites before the active set of predictors within each site are selected — directly imposing the same leads to a loss of <math>\ell_2</math>-consistency, making follow-up analysis problematic. Therefore, once a suitable model for high-dimensional multi-site regression is chosen, the first requirement is to characterize its consistency.<br />
<br />
===Sparse Multi-Site Lasso Regression===<br />
The sparse multi-site Lasso variant is chosen because multi-task Lasso underperforms when the sparsity pattern of predictors is not identical across sites[4].The hyperparameter <math>\alpha\in [0, 1]</math> balances both penalties between L1 regularization and the Group Lasso penalty on a group of features. The difference is that SMS Lasso generalizes the Lasso to the multi-task setting by replacing the L1-norm regularization with the sum of sup-norm regularization[8].<br />
*Larger <math>\alpha</math> weighs the L1 penalty more<br />
*Smaller <math>\alpha</math> puts more weight on the grouping. <br />
Note that α = 0.97 discovers more always-active features, while preserving the ratio of correctly discovered active features to all the discovered ones. (MSE vs Sample Size plots(c))<br />
<br />
Similar to a Lasso-based regularization parameter, <math>\lambda</math> here will produce a solution path (to select coefficients) for a given <math>\alpha</math>[1].<br />
<br />
===Setting the hyperparameter <math>\alpha </math> using Simultaneous Inference===<br />
Step 1: They apply simultaneous inference (like multi sample-splitting or de-biased Lasso) using all features at each of the k sites with FWER control. This step yields “site-active” features for each site, and therefore, gives the set of always-active features and the sparsity patterns<br />
<br />
<br />
Step 2: Then, each site runs a Lasso and chooses a λi based on cross-validation. Then they set λmulti-site to be the minimum among the best λs from each site. Using λmulti-site , we can vary to fit various sparse multi-site Lasso models – each run will select some number of always-active features. Then plot α versus the number of always-active features.<br />
<br />
<br />
Step 3: Finally, based on the sparsity patterns from the site-active set, they estimate whether the sparsity patterns across sites are similar or different (i.e., share few active features). Then, based on the plot from step (2), if the sparsity patterns from the site-active sets are different (similar)<br />
across sites, then the smallest (largest) value of that selects the minimum (maximum) number of always-active features is chosen<br />
<br />
==Experiments==<br />
There are 2 distinct experiments described:<br />
#Performing simulations to evaluate the hypothesis test and sparse multi-site Lasso; <br />
#Pooling 2 Alzheimer's Disease datasets and examining the improvements in statistical power. This experiment was also done with the view of evaluating whether pooling is beneficial for regression and whether it yields tangible benefits in investigating scientific hypotheses[1].<br />
<br />
===Power and Type I Error===<br />
<br />
#The first set of simulations evaluate '''Case 1 (Sharing all β):''' The simulations are repeated 100 times with 9 different sample sizes. As n increases, both MSEs decrease (two-site model and baseline single site model), and the test tends to reject pooling the multi-site data.<br />
#The second set of simulations evaluates '''Case 2 variables (Sharing subset of β):''' For small n, MSE of two-site model is much smaller than baseline, and as sample size increases this difference reduces. The test accepts with high probability for small n,and as sample size increases it rejects with high power.<br />
<br />
===SMS Lasso L2 Consistency===<br />
In order to test the Sparse Multi-Site Model, the case where sparsity patterns are shared is considered separately from the case where they are not shared. Here, 4 sites with n = 150 samples each and p = 400 features were used.<br />
#Few Sparsity Patterns Shared:6 shared features and 14 site-specific features (out of the 400) are set to be active in 4 sites. The chosen <math>\alpha</math>= 0:97 has the smallest error, across all <math>\lambda</math>s, thereby implying a better <math>\ell</math>2 consistency. <math>\alpha</math>= 0:97 discovers more always-active features, while preserving the ratio of correctly discovered active features to all the discovered ones.<br />
#Most Sparsity Patterns Shared: 16 shared and 4 site-specific features to be active among all 400 features were set.The proposed choice of <math>\alpha</math> = 0.25 preserves the correctly discovered number of always-active features. The ratio of correctly discovered active features to all discovered features increases here.<br />
<br />
===Combining AD Datasets from Multiple Sites===<br />
Pooling is evaluated empirically in a neuroscience problem regarding the combination of 2 Alzheimer's Datasets from different sources: ADNI (Alzheimer’s Disease Neuroimage Initiative) and ADlocal ( Wisconsin ADRC). The sample sizes are 318 and 156 respectively. Cerebrospinal fluid (CSF) protein levels are the inputs, and the response is hippocampus volume. Using 81 age-matched<br />
samples from each dataset, first domain adaptation is performed (using a maximum mean discrepancy objective as a measure of distance between the two marginals), and then transform CSF proteins from ADlocal to match with ADNI. The main aim is to evaluate whether adding ADlocal data to ADNI will improve the regression performed on ADNI. This is done by training a regression model on the ‘transformed’ ADlocal and a subset of ADNI data, and then testing the resulting model on the remaining ADNI samples.<br />
*The results show that pooling after transformation is at least as good as using ADNI data alone, thereby accepting the hypothesis test. The test rejection power increases with increase in n. The strategy rejects the pooling test if performed without domain adaptation[1].<br />
<br />
==Conclusion==<br />
The following are the contributions by the authors' research.<br />
#The main result is a hypothesis test to evaluate whether pooling data across multiple sites for regression (before or after correcting for site-specific distributional shifts) can improve the estimation (mean squared error) of the relevant coefficients (while permitting an influence from a set of confounding variables). <br />
#Show how pooling can be used ( in certain regimes of high dimensional and standard linear regression) even when the features are different across sites. For this the authors show the <math>\ell_2</math>-consistency rate which supports the use of spare-multi-task Lasso when sparsity patterns are not identical<br />
#Experimental results showing consistent acceptance power for early Alzheimer’s detection (AD) in humans, where data are pooled from different sites.<br />
<br />
==References==<br />
#Hao Henry Zhou, Yilin Zhang, Vamsi K. Ithapu, Sterling C. Johnson, Grace Wahba, Vikas Singh, When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, <math>\ell_2</math>-consistency and Neuroscience Applications, ICML 2017<br />
#https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/<br />
#Understanding the Bias-Variance Tradeoff - Scott Fortmann Roe [http://scott.fortmann-roe.com/docs/BiasVariance.html Link]<br />
#G Swirszcz, AC Lozano, Multi-level lasso for sparse multi-task regression, ICML 2012<br />
# A Visual representation L1, L2 Regularization - https://www.youtube.com/watch?v=sO4ZirJh9ds<br />
# Why does L1 induce sparse weights? https://www.youtube.com/watch?v=jEVh0uheCPk<br />
# Meinshausen, Nicolai and Yu, Bin. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics.<br />
# Liu, Han, Palatucci, Mark, and Zhang, Jian. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 649–656. ACM, 2009<br />
# http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:lasso.jpg&diff=29144File:lasso.jpg2017-11-02T17:59:28Z<p>Jdeng: </p>
<hr />
<div></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_can_Multi-Site_Datasets_be_Pooled_for_Regression%3F_Hypothesis_Tests,_l2-consistency_and_Neuroscience_Applications:_Summary&diff=29143When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, l2-consistency and Neuroscience Applications: Summary2017-11-02T17:57:32Z<p>Jdeng: /* Lasso Regression and Model Selection: */</p>
<hr />
<div><br />
This page is a summary for this ICML 2017 paper[1].<br />
== Introduction ==<br />
===Some Basic Concepts and Issues===<br />
While the challenges posed<br />
by large-scale datasets are compelling, one is often faced<br />
with a fairly distinct set of technical issues for studies in biological<br />
and health sciences. For instance, a sizable portion of scientific research is carried out by small or medium sized<br />
groups supported by modest budgets. Hence, there are financial<br />
constraints on the number of experiments and/or number<br />
of participants within a trial, leading to small datasets. Similar datasets from multiple sites can be pooled to potentially<br />
improve statistical power and address the above issue. In reality, when analysis based a study/experiment, there comes about interesting follow-up questions during the course of the study; the purpose of the paper explore the ideology that when pooling the follow-up questions along with the original data set facilitate as necessities to deduce a viable prediction.<br />
====Regression Problems====<br />
Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in the presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things[2]:<br />
*Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)<br />
*Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features<br />
====Ridge Regression and Overfitting:====<br />
Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity [9]. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It is hoped that the net effect will be to give estimates that are more reliable.<br />
<br />
Ridge regression is commonly used in econometrics and thus machine learning. When fitting a model, unnecessary inputs or inputs with co-linearity might bring disastrously huge coefficients (with a large variance).<br />
Ridge regression performs L2 regularization, i.e. it adds a factor of sum of squared coefficients in the optimization objective. Thus, ridge regression optimizes the following:<br />
*''Objective = RSS + λ * (sum of square of coefficients)''<br />
Note that performing ridge regression is equivalent to minimizing RSS ( Residual Sum of Squares) under the constraint that sum of squared coefficients is less than some function of λ, say s(λ). Ridge regression usually utilizes the method of cross-validation where we train the model on the training set using different values of λ and optimizing the above objective function. Then each of those model (each trained with different λ's) are tested on the validation set to evaluate their performance.<br />
<br />
====Lasso Regression and Model Selection:====<br />
LASSO stands for Least Absolute Shrinkage and Selection Operator. <br />
Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following.<br />
*''Objective = RSS + λ * (sum of absolute value of coefficients)''<br />
#λ = 0: Same coefficients as simple linear regression<br />
#λ = ∞: All coefficients zero (same logic as before)<br />
#λ < α < ∞: coefficients between 0 and that of simple linear regression<br />
A feature of Lasso regression is its job as a selection operator, i.e. it usually shrinks a part of coefficients to zero, while keeping the values of other coefficients. Thus it can be used in opting unnecessary coefficients out of the model.<br />
<br />
To describe this, let us rewrite the Lasso regression $\min_\beta ||y-X\beta||^2+\lambda||\beta||_1$ and ridge regression $\min_\beta ||y-X\beta||^2+\lambda||\beta||_2$ to its dual form<br />
\[<br />
\text{Ridge Regression} \quad \min_\beta ||y-X\beta||^2, \quad \text{subject to } || \beta ||_2 \leq t <br />
\]<br />
<br />
\[<br />
\text{Lasso Regression} \quad \min_\beta ||y-X\beta||^2, \quad \text{subject to } || \beta ||_1 \leq t <br />
\]<br />
Then the graph from Chapter 3 of Hastie et al. (2009) demonstrates how the Lasso regression shrinks a part of coefficients to zero.<br />
<br />
<br />
Another type of regression model that is worth mentioning here is what we call Elastic Net Regression. This type of regression model is utilizing both L1 and L2 regularization, namely combining the regularization techniques used in lasso regression and ridge regression together in the objective function. This type of regression could also be of possible interest to be applied in the context of this paper. Its objective function is shown below, where we can see both the sum of absolute value of coefficients and the sum of square of coefficients are included: <br />
<math> \hat{\beta} = argmin ||y – X \beta||^2 – λ_2 ||\beta||^2 – λ_1||\beta|| </math> w.r.t. <math>\beta</math><br />
<br />
====Bias-Variance Trade-Off====<br />
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).<br />
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting)[3].<br />
Mean square error (MSE) is defined by (variance + squared bias).<br />
A thing to mention is the following theorem:<br />
* For ridge regression, there '''exists''' a certain λ such that the MSE of coefficients calculated by ridge regression is smaller than that calculated by direct regression.<br />
<br />
===Related Work===<br />
====Meta-analysis approaches====<br />
Meta analysis is a statistical analysis which combines the results of several studies. There are several methods for non-imaging Meta analysis: p-value combining, fixed effects model, random effects model, and Meta regression. When datasets at different sites cannot be shared or pooled, various strategies exist that cumulate the general findings from analyses on different datasets. However, minor violations of assumptions can lead to misleading scientific conclusions (Greco et al., 2013), and substantial personal judgment (and expertise) is needed to conduct them.<br />
<br />
====Domain adaptation/shift====<br />
The idea of addressing “shift” within datasets has been rigorously studied within statistical machine learning. However, these focuses on the algorithm itself and do not address the issue<br />
of whether pooling the datasets, after applying the calculated adaptation (i.e., transformation), is beneficial. The goal in this work is to assess whether multiple datasets can be pooled — either before or usually after applying the best domain adaptation methods — for improving our estimation of the relevant coefficients within linear regression. A hypothesis test is proposed to directly address this question.<br />
<br />
====Simultaneous High dimensional Inference====<br />
Simultaneous high dimensional inference models are an active research topic in statistics. Multi sample- splitting takes half of the data set for feature selection and the remaining portion of the data set for calculating p values. The authors use contributions in this area to extend their results to a higher dimensional setting.<br />
<br />
==The Hypothesis Test==<br />
The hypothesis test to evaluate statistical power improvements (e.g., mean squared error) when running a regression model on a pooled dataset is discussed below.β corresponds to the coefficient vector (i.e., predictor weights), then the regression model is<br />
*<math>min_{β} \frac{1}{n}\left \Vert y-Xβ \right \|_2^2</math> ........ (1) <br />
Where $X ∈ R^{n×p}$ and $y ∈ R^{n×1}$ denote the feature matrix of predictors and the response vector respectively. <br />
If k denotes the number of sites, a domain adaptation scheme needs to be applied to account for the distributional shifts between the k different predictors <math>\lbrace X_i \rbrace_{i=1}^{k} </math>, and then run a regression model. If the underlying “concept” (i.e., predictors and responses relationship) can be assumed to be the same across the different sites, then it is reasonable to impose the same β for all sites. For example, the influence of CSF protein measurements on cognitive scores of an individual may be invariant to demographics. if the distributional mismatch correction is imperfect, we may define ∆ βi = βi − β∗ where i ∈ {1,...,k} as the residual difference between the site-specific coefficients and the true shared coefficient vector (in the ideal case, we have ∆ βi = 0)[1].<br />
Therefore we derive the Multi-Site Regression equation ( Eq 2) where <math>\tau_i</math> is the weighting parameter for each site<br />
*<math>min_{β} \displaystyle \sum_{i=1}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2}</math> ......... (2)<br />
where for each site i we have $y_i = X_iβ_i +\epsilon_i$ and $\epsilon_i ∼ N (0, σ^2_i) $<br />
<br />
===Separate Regression or Shared Regression ?===<br />
Since the underlying relationship between predictors and responses is the same across the different datasets ( from which its pooled), estimates of <math>\beta_i</math> across all k sites are restricted to be the same. Without this constraint , (3) is equivalent to fitting a regression separately on each site. To explore whether this constraint improves estimation, the Mean Square Error (MSE) needs to be examined[1]. Hence, using site 1 as the reference, and setting <math>\tau_1</math> = 1 in (2) and considering <math>\beta*=\beta_1</math>,<br />
*<math>min_{β} \frac{1}{n}\left \Vert y_1-X_1β \right \|_2^2 + \displaystyle \sum_{i=2}^k {\tau_i^2\left \Vert y_i-X_iβ \right \|_2^2}</math> .........(3)<br />
To evaluate whether MSE is reduced, we first need to quantify the change in the bias and variance of (3) compared to (1).<br />
<br />
====Case 1: Sharing all <math>\beta</math>s====<br />
<math>n_i </math>: sample size of site i <br/><br />
<math>\hat{β}_i </math>: regression estimate from a specific site i. <br/><br />
<math>\Delta β^T </math>: length ''kp'' vector<br/><br />
<math>\hat{\Sigma}_i </math>: the sample covariance matrix of the predictors from site i.<br/><br />
<math>G \in\mathbb{R}^{(k-1)p \times (k-1)p} </math>: the covariance matrix of <math>\Delta\hat{β} </math>, with <math>G_{ii}=\left(n_1\hat{\Sigma}_1 \right)^{-1} + \left(n_i\tau_i^2\hat{\Sigma}_i \right)^{-1} </math> and <math>G_{ij}=\left(n_1\hat{\Sigma}_1 \right)^{-1} </math>, <math>i\neq j </math><br/><br />
<br />
[[File:Equation_4567.png|thumb|alt=Alt text|]]Lemma 2.2 bounds the increase in bias and reduction in variance. Theorem 2.3 is the author's main test result.Although <math>\sigma_i</math> is typically<br />
unknown, it can be easily replaced using its site specific estimation. Theorem 2.3 implies that we can conduct a non-central <math>\chi^2</math> distribution test based on the statistic.<br />
<br />
<br />
Theorem 2.3 implies that the sites, in fact, do not even need to share the full dataset to assess whether pooling will be useful. Instead, the test only requires very high-level statistical information such as <math>\hat{\beta}_i,\hat{\Sigma}_i,\sigma_i</math> and <math>n_i</math> for all participating sites – which can be transferred without computational overhead. <br />
<br />
One can find R code for the hypothesis test for Case 1 in https://github.com/hzhoustat/ICML2017 as provided by the authors. In particular the Hypotest_allparam.R script provides the hypothesis test whereas Simultest_allparam.R provides some simulation examples that illustrate the application of the test under various different settings.<br />
<br />
====Case 2: Sharing a subset of <math>\beta</math>s====<br />
For example, socio-economic status may (or may not) have a significant association with a health outcome (response) depending on the country of the study (e.g., insurance coverage policies). Unlike Case 1, <math>\beta</math> cannot be considered to be the same across all sites. The model in (3) will now include another design matrix of predictors <math>Z\in R^{n*q} </math>and corresponding coefficients <math>\gamma_i</math> for each site i,<br />
<br />
<br />
<math>min_{β,\gamma} \sum_{i=1}^{k}\tau_i^2\left \Vert y_i-X_iβ-Z_i\gamma_i \right \|_2^2</math> ... (9)<br />
<br />
where<br />
<br />
<math>y_i=X_i \beta^* + X_i \Delta \beta_i + Z_i \gamma_i^* + \epsilon_i, \tau_1=1</math> ... (10)<br />
<br />
<br />
While evaluating whether the MSE of <math>\beta</math> reduces, the MSE change in <math>\gamma</math> is ignored because they correspond to site-specific variables. If <math>\hat{\beta}</math>is close to the “true” <math>\beta^*</math>, it will<br />
also enable a better estimation of site-specific variables[1]<br />
<br />
One can find R code for the hypothesis test for Case 2 in https://github.com/hzhoustat/ICML2017 as provided by the authors. In particular the Hypotest_subparam.R script provides the hypothesis test whereas Simultest_subparam.R provides some simulation examples that illustrate the application of the test under various different settings.<br />
<br />
==Sparse Multi-Site Lasso and High Dimensional Pooling==<br />
Pooling multi-site data in the high-dimensional setting where the number of predictors p is much larger than number of subjects n studied ( p>>n) leads to a high sparsity condition where many variables have their coefficients with limits tending to 0. Lasso Variable Selection helps in selecting the right coefficients for representing the relationship between the predictors and subjects <br />
<br />
===<math>\ell_2</math>-consistency===<br />
----<br />
[[File:MSEs and Hypothesis Test Results.png|thumb|alt=Alt text|MSE vs Sample Size plots]]<br />
[[File:Sparse Multisite Lasso.png|thumb|300x500|alt=Alt text|Sparse Multi-Site Lasso]]In the background of asymptotic analysis and approximations, the Lasso estimator is not variable selection consistent if the "Irrepresentable Condition" fails[7]. The Irrepresentable Condition: Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Which means, even if the exact sparsity pattern might not be recovered, the estimator can still be a good approximation to the truth. This also suggests that, for Lasso, estimation consistency might be easier to achieve than variable selection consistency.In classical regression, <math>\ell_2</math> consistency properties are well known. Imposing the same <math>\beta</math> across sites works in (3) because we understand its consistency. In contrast, in the case where p>>n, one cannot enforce a shared coefficient vector for all sites before the active set of predictors within each site are selected — directly imposing the same leads to a loss of <math>\ell_2</math>-consistency, making follow-up analysis problematic. Therefore, once a suitable model for high-dimensional multi-site regression is chosen, the first requirement is to characterize its consistency.<br />
<br />
===Sparse Multi-Site Lasso Regression===<br />
The sparse multi-site Lasso variant is chosen because multi-task Lasso underperforms when the sparsity pattern of predictors is not identical across sites[4].The hyperparameter <math>\alpha\in [0, 1]</math> balances both penalties between L1 regularization and the Group Lasso penalty on a group of features. The difference is that SMS Lasso generalizes the Lasso to the multi-task setting by replacing the L1-norm regularization with the sum of sup-norm regularization[8].<br />
*Larger <math>\alpha</math> weighs the L1 penalty more<br />
*Smaller <math>\alpha</math> puts more weight on the grouping. <br />
Note that α = 0.97 discovers more always-active features, while preserving the ratio of correctly discovered active features to all the discovered ones. (MSE vs Sample Size plots(c))<br />
<br />
Similar to a Lasso-based regularization parameter, <math>\lambda</math> here will produce a solution path (to select coefficients) for a given <math>\alpha</math>[1].<br />
<br />
===Setting the hyperparameter <math>\alpha </math> using Simultaneous Inference===<br />
Step 1: They apply simultaneous inference (like multi sample-splitting or de-biased Lasso) using all features at each of the k sites with FWER control. This step yields “site-active” features for each site, and therefore, gives the set of always-active features and the sparsity patterns<br />
<br />
<br />
Step 2: Then, each site runs a Lasso and chooses a λi based on cross-validation. Then they set λmulti-site to be the minimum among the best λs from each site. Using λmulti-site , we can vary to fit various sparse multi-site Lasso models – each run will select some number of always-active features. Then plot α versus the number of always-active features.<br />
<br />
<br />
Step 3: Finally, based on the sparsity patterns from the site-active set, they estimate whether the sparsity patterns across sites are similar or different (i.e., share few active features). Then, based on the plot from step (2), if the sparsity patterns from the site-active sets are different (similar)<br />
across sites, then the smallest (largest) value of that selects the minimum (maximum) number of always-active features is chosen<br />
<br />
==Experiments==<br />
There are 2 distinct experiments described:<br />
#Performing simulations to evaluate the hypothesis test and sparse multi-site Lasso; <br />
#Pooling 2 Alzheimer's Disease datasets and examining the improvements in statistical power. This experiment was also done with the view of evaluating whether pooling is beneficial for regression and whether it yields tangible benefits in investigating scientific hypotheses[1].<br />
<br />
===Power and Type I Error===<br />
<br />
#The first set of simulations evaluate '''Case 1 (Sharing all β):''' The simulations are repeated 100 times with 9 different sample sizes. As n increases, both MSEs decrease (two-site model and baseline single site model), and the test tends to reject pooling the multi-site data.<br />
#The second set of simulations evaluates '''Case 2 variables (Sharing subset of β):''' For small n, MSE of two-site model is much smaller than baseline, and as sample size increases this difference reduces. The test accepts with high probability for small n,and as sample size increases it rejects with high power.<br />
<br />
===SMS Lasso L2 Consistency===<br />
In order to test the Sparse Multi-Site Model, the case where sparsity patterns are shared is considered separately from the case where they are not shared. Here, 4 sites with n = 150 samples each and p = 400 features were used.<br />
#Few Sparsity Patterns Shared:6 shared features and 14 site-specific features (out of the 400) are set to be active in 4 sites. The chosen <math>\alpha</math>= 0:97 has the smallest error, across all <math>\lambda</math>s, thereby implying a better <math>\ell</math>2 consistency. <math>\alpha</math>= 0:97 discovers more always-active features, while preserving the ratio of correctly discovered active features to all the discovered ones.<br />
#Most Sparsity Patterns Shared: 16 shared and 4 site-specific features to be active among all 400 features were set.The proposed choice of <math>\alpha</math> = 0.25 preserves the correctly discovered number of always-active features. The ratio of correctly discovered active features to all discovered features increases here.<br />
<br />
===Combining AD Datasets from Multiple Sites===<br />
Pooling is evaluated empirically in a neuroscience problem regarding the combination of 2 Alzheimer's Datasets from different sources: ADNI (Alzheimer’s Disease Neuroimage Initiative) and ADlocal ( Wisconsin ADRC). The sample sizes are 318 and 156 respectively. Cerebrospinal fluid (CSF) protein levels are the inputs, and the response is hippocampus volume. Using 81 age-matched<br />
samples from each dataset, first domain adaptation is performed (using a maximum mean discrepancy objective as a measure of distance between the two marginals), and then transform CSF proteins from ADlocal to match with ADNI. The main aim is to evaluate whether adding ADlocal data to ADNI will improve the regression performed on ADNI. This is done by training a regression model on the ‘transformed’ ADlocal and a subset of ADNI data, and then testing the resulting model on the remaining ADNI samples.<br />
*The results show that pooling after transformation is at least as good as using ADNI data alone, thereby accepting the hypothesis test. The test rejection power increases with increase in n. The strategy rejects the pooling test if performed without domain adaptation[1].<br />
<br />
==Conclusion==<br />
The following are the contributions by the authors' research.<br />
#The main result is a hypothesis test to evaluate whether pooling data across multiple sites for regression (before or after correcting for site-specific distributional shifts) can improve the estimation (mean squared error) of the relevant coefficients (while permitting an influence from a set of confounding variables). <br />
#Show how pooling can be used ( in certain regimes of high dimensional and standard linear regression) even when the features are different across sites. For this the authors show the <math>\ell_2</math>-consistency rate which supports the use of spare-multi-task Lasso when sparsity patterns are not identical<br />
#Experimental results showing consistent acceptance power for early Alzheimer’s detection (AD) in humans, where data are pooled from different sites.<br />
<br />
==References==<br />
#Hao Henry Zhou, Yilin Zhang, Vamsi K. Ithapu, Sterling C. Johnson, Grace Wahba, Vikas Singh, When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, <math>\ell_2</math>-consistency and Neuroscience Applications, ICML 2017<br />
#https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/<br />
#Understanding the Bias-Variance Tradeoff - Scott Fortmann Roe [http://scott.fortmann-roe.com/docs/BiasVariance.html Link]<br />
#G Swirszcz, AC Lozano, Multi-level lasso for sparse multi-task regression, ICML 2012<br />
# A Visual representation L1, L2 Regularization - https://www.youtube.com/watch?v=sO4ZirJh9ds<br />
# Why does L1 induce sparse weights? https://www.youtube.com/watch?v=jEVh0uheCPk<br />
# Meinshausen, Nicolai and Yu, Bin. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics.<br />
# Liu, Han, Palatucci, Mark, and Zhang, Jian. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 649–656. ACM, 2009<br />
# http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Learning_a_Probabilistic_Latent_Space_of_Object_Shapes_via_3D_GAN&diff=28485STAT946F17/ Learning a Probabilistic Latent Space of Object Shapes via 3D GAN2017-10-25T02:45:56Z<p>Jdeng: /* 3D-VAE-GANs */</p>
<hr />
<div>= Introduction =<br />
In this work, a novel method for 3D object generation is presented. This framework, namely 3D Generative Adversarial Networks (3D GAN) is an extension of GANs for 2D image generation. Here, a probabilistic space is sampled for a latent vector representation which then passes through a set volumetric convolutional layers resulting in a novel generated 3D object. The benefits of this approach are three-fold<br />
# the use of adversarial criterion, in place of traditional heuristic criteria, allows the generator to implicitly capture object structure leading to high quality and novel 3D objects<br />
# the GAN learns a mapping from latent space to the space of generated objects automatically allowing it to bypass the need for reference CAD models when generating new 3D samples<br />
# the adversarial discriminator can learn, in an unsupervised manner, a powerful 3D shape descriptor (i.e., feature vector), that is widely applicable and performs competitively in 3D object recognition.<br />
From the experimental results, the authors prove that the unsupervisedly learned features achieve excellent performance on 3D object recognition, comparable to the supervised learning methods.<br />
<br />
=== Related Work ===<br />
Existing methods<br />
* Borrow parts from objects in existing CAD model libraries → generate realistic but not novel samples<br />
* Learn deep object representations based on voxelized objects → fail to capture highly structured differences between 3D objects<br />
* Mostly learn based on a supervised criterion → limited to the objects in the dataset<br />
<br />
According to Karpathy et al. in their OpenAI blog post, there are 3 popular generative model approaches that have been widely adopted, namely Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Recurrent Neural Networks (RNNs) (Karpathy et al., 2016). All of these methods have their own competitive edges and the first 2 approaches are the main focus of research in this paper. In order to construct a more comprehensive picture of generative modelling, RNN approach or more specifically PixelRNN approach will be briefly explained below. <br />
<br />
The fundamental concept of PixelRNN is to model each pixel within an image based on previous pixel and their RGB color values (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 2). More specifically, PixelRNN “model the pixels as discrete values using a multinomial distribution implemented with a simple softmax layer.” (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 2). Such concept can be easily explained mathematically as the following: essentially the goal of the model is to predict the probability of the next pixel <math>x_i</math> given all previous pixels <math>x_1, …, x_{i - 1} </math>. Furthermore, probability of each pixel <math> P(x_i) </math> is also influenced by 3 color channels (red, blue & green or RGB). Thus, overall the probability of a pixel <math>x_i</math> can be presented by the following equation:<br />
<center><br />
<math><br />
P(x_i)= P(x_{i, R}| x_1, …, x_{i - 1}) P(x_{i, B}| x_{i, R}, x_1, …, x_{i - 1}) P(x_{i, G}| x_{i, R}, x_{i, B}, x_1, …, x_{i - 1})<br />
</math><br />
</center><br />
In Oord, Kalchbrenner, & Kavukcuoglu‘s paper, they estimated <math> P(x_i) </math> as “as a discrete distribution, with every conditional distribution being a multinomial that is modeled with a softmax layer.” (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 3). However, a major limitation of this method is the computationally heavy process of sampling because of the sequence of conditional probabilities. Thus, more efficient approaches such as 3D-GANs and 3D-VAE-GANs are introduced in this paper. <br />
<br />
Oord et al. [10] presented a deep neural network for predicting the image pixels sequentially along its two dimensions. Their method modelled the probability value of raw pixels thereby encoding the complete set of dependencies the corresponding image. <br />
<br />
Karpathy et al. [11] describe some projects which helps in enhancing or utilising the generative models. Particularly, they speak about how to improve GANs, VAEs. Also, they introduce InfoGAN which is an extension of GAN which can learn disentangled and interpretable representations related to images.<br />
<br />
= Methodology =<br />
Let us first review Generative Adversarial Networks (GANs). As proposed in Goodfellow et al. [2014], GANs consist of a generator and a discriminator, where the discriminator tries to classify real objects and objects synthesized by the generator, while the generator attempts to confuse the discriminator. With proper guidance, this adversarial game will result in a generator that is able to synthesize fake samples very similar to the real training samples that the discriminator cannot distinguish. This can be thought of as a zero-sum or minimax two player game. The analogy typically used is that the generative model is like ''a team of counterfeiters, trying to produce and use fake currency'' while the discriminative model is like ''the police, trying to detect the counterfeit currency''. The generator is trying to fool the discriminator while the discriminator is trying to not get fooled by the generator. As the models train through alternating optimization, both methods are improved until a point where the ''counterfeits are indistinguishable from the genuine articles''.<br />
<br />
=== 3D-GANs ===<br />
In this paper, 3D Generative Adversarial Networks (3D-GANs) are presented as a simple extension of GANs for 2D imagery. Here, the model is composed of i) a generator (G) which maps a 200-dimensional latent vector $z$ to a 64 x 64 x 64 cube, representing the object G($z$) in voxel space, and ii) a discriminator (D) which outputs a confidence value D($x$) of whether a 3D object input $x$ is real or synthetic. Following Goodfellow et al. [2014], the classification loss used at the end of the discriminator is binary cross-entropy as<br />
<br />
$L_{3D-GAN} = \log D(x) + \log \big(1 − D(G(z)) \big)$<br />
<br />
where $x$ is a real object in a 64 x 64 x 64 space, and $z$ is randomly sampled noise from a distribution $p(z)$. In this work coefficients of $z$ are randomly sampled from a probabilistic latent space (Uniform [0,1]). Each dimension of z is an independent and identically distributed random variable. ( A collection is an IID if all random variables have same probability distribution and all are mutually independent.)<br />
<br />
=== 3D-VAE-GANs ===<br />
<br />
3D-VAE-GANs as an extension of 3D-GANs introduces an additional image encoder (E), which takes as input a 2D image $y$ and outputs the latent representation vector $z$. Inspired by the work of Larsen et al. [2015] on VAE-GANs, the addition of the E component allows for a mapping between 2D objects and their 3D shapes to be learned simultaneously with the adversarial training of GANs that learn to generate synthetic but realistic 3D objects. This means that after training, a 2D image can be inputted into the 3D-VAE-GAN network resulting in a realistic rendering of the corresponding 3D object for that 2D image. One would expect that this network performs better than a singular VAE network that takes 2D images as input and outputs 3D shapes. Unfortunately the authors do not provide any comparsion between such setups.<br />
<br />
The loss function for the 3D-VAE-GAN is similar to that of the VAE-GAN. These loss functions have the following form:<br />
<br />
\begin{equation}L = L_{3D-GAN} + \alpha_{1}L_{KL} + \alpha_{2}L_{recon},\label{eq1}\end{equation}<br />
<br />
where $\alpha_{1}$ and $\alpha_{2}$ are weights of the KL divergence and reconstruction loss. $L_{recon}$ is the reconstruction loss, $L_{3D-GAN}$ is the cross-entropy loss and $L_{KL}$ is the divergence loss. <br />
<br />
As depicted in the figure on the right [from Larsen et al., 2015], the setup of VAE-GAN shows that VAE and GAN are combined by sharing the decoder of VAE with the generator of GAN.<br />
<br />
As outlined in the supplementary material of this paper, the training of 3D-VAE-GAN is done in the steps below. Here, $y_i$ is a 2D image and $x_i$ is its corresponding 3D shape. In each training iteration $t$, a random sample $z_t$ is generated from $\mathcal{N}(0, I)$, and the discriminator (D), image encoder (E), and generator (G) are updated in turn.<br />
<br />
[[File:amirhk_vae_gan.png|right|650px]]<br />
<br />
* Step 1: Update the discriminator D by minimizing the following loss function:<br />
$$\log D(x_i) + \log \big(1 − D(G(z_t)) \big)$$<br />
<br />
(Although the problem is not exactly the same with that studied by Larsen et al. [2015], better results may be observed when using samples from $q(z|y)$ (i.e. the encoder) in addition to our prior $p(z)$ in the<br />
GAN objective:<br />
\[<br />
\log D(x_i) + \log \big( 1- D(G(z_t)) \big) -\log \big( 1-D(G(E(y_i))) \big).<br />
\]<br />
Because the negative sample $G(E(y))$ is much more likely to be similar to $y$ than $G(z_t)$. When updating according to LGAN, Larsen et al. [2015] suspect<br />
that having similar positive and negative samples makes for a more useful learning signal.<br />
)<br />
<br />
* Step 2: Update the image encoder E by minimizing the following loss function:<br />
$$\alpha_1D_{KL} \big(\mathcal{N}(E_{mean}(y_i), E_{var}(y_i)) \big|\big| \mathcal{N}(0, I) \big) + \alpha_2\big|\big|G(E(y_i)) − x_i\big|\big|_2$$<br />
<br />
where $E_{mean}(y_i)$ and $E_{var}(y_i)$ are the predicted mean and variance of the latent variable $z$, respectively.<br />
<br />
* Step 3: Update the generator G by minimizing the following loss function:<br />
$$\log \big(1 − D(G(z_t)) \big) + \alpha_2\big|\big|G(E(y_i)) − x_i\big|\big|_2$$<br />
<br />
(In Step 2 and Step 3, the formulae in original materials do not include $\alpha_1$ and $\alpha_2$. I believe it is a typo, according to the loss function.)<br />
<br />
= Training Details =<br />
<br />
=== Network Architecture ===<br />
<br />
===== Generator =====<br />
The generator used in 3D-GAN follows the architecture of Radford et al.'s [2016] all-convolutional network, a neural network with no fully-connected and no pooling layers. As portrayed in Figure 1, this network comprises of 5 volumetric fully convolutional layers with kernels of size 4 x 4 x 4 and stride 2. Batch normalization and ReLU layers $(f(x) = \mathbb{1}(x \ge 0)(x))$ are present after every layer, and the final convolution layer is appended with a Sigmoid layer. The input is a 200-dimensional vector and the output is 64 x 64 x 64 matrix with values in [0,1].<br />
<br />
[[File:amirhk_network_arch.png|right|650px]]<br />
<br />
===== Discriminator =====<br />
The discriminator mostly mirrors the generator. Particularly, discriminator network takes input either from the generator or real data and it tries to predict if the input is generated or is in fact real. This network takes a 64 x 64 x 64 matrix as input and outputs a real number in [0,1]. Instead of ReLU activation function, the discriminator has leaky ReLU layers $(f(x) = \mathbb{1}(x \lt 0)(\alpha x) + \mathbb{1}(x \ge 0)(x))$ with $\alpha$ = 0.2. Batch normalization layers and Sigmoid layers are consistent in both the generator and discriminator networks.<br />
<br />
===== Image Encoder =====<br />
Finally, the image encoder in the VAE network takes as input an RGB image of size 256 x 256 x 3 and outputs a 200-dimensional vector. This network again consists of 5 spatial (not volumetric) convolutional layers with numbers of channels {64, 128, 256, 512, 400}, kernel sizes {11, 5, 5, 5, 8}, and strides {4, 2, 2, 2, 1}, respectively. ReLU and batch normalization layers are interspersed between every convolutional layer. While the output of this image encoder is 200-dimensional, the final layer outputs a 400-dimensional vector that represents a 200-dimensional Gaussian (split evenly to represent the mean and diagonal covariance). This is a common component of variational auto-encoder networks. Therefore, a final sampling layer is appended to the last convolutional layer to sample a 200-dimensional vector from the Gaussian distribution, which is later used by the 3D-GAN.<br />
<br />
=== Coupled Generator-Discriminator Training ===<br />
Training GANs is tricky because in practice training a network to generate objects is more difficult than training a network to distinguish between real and fake samples. In other words, training the generator is harder than training the discriminator. Intuitively, it becomes difficult for the generator to extract signal for improvement from a discriminator that is way ahead, as all examples it generated would be correctly identified as synthetic with high confidence. This problem is compounded when we deal with 3D generated objects (compared to 2D) due to the higher dimensionality. There exists different strategies to overcome this challenge, some of which we saw in class:<br />
<br />
* 1 discriminator update every N generator updates<br />
* Capped gradient updates, where only a maximum gradient is propagated back through the network for the discriminator network, essentially capping how fast it can learn<br />
<br />
The approach used in this paper is interesting in that it adaptively decides whether to train the network or not. Here, for each batch, D is only updated if its accuracy in the last batch is <= 80%. Additionally, the generator learning rate is set to 2.5 x 10e-3 whereas the discriminator learning rate is set to 10e-5. This further caps the speed of training for the discriminator relative to the generator. In fact, many such techniques are necessary when training GANs, due to the fact that the optimization problem they are designed to solve is inherently different from the intended goal of finding a Nash equilibrium in a non-convex game [Salimans et al.,2016]. Some recently proposed techniques include feature matching, minibatch discrimination, historical averaging, one-sided label smoothing, and virtual batch normalization [Salimans et al.,2016].<br />
<br />
= Evaluation =<br />
<br />
To assess the quality of 3D-GAN and 3D-VAE-GAN, the authors perform the following set of experiments<br />
# Qualitative results for 3D generated objects<br />
# Classification performance of learned representations w/o supervision<br />
# 3D object reconstruction from a single image<br />
# Analyzing learned representations for generator and discriminator<br />
<br />
Each of these experiments has a dedicated section below with experiment setup and results. First, we shall introduce the datasets used across these experiments.<br />
<br />
=== Datasets ===<br />
<br />
* ModelNet 10 & ModelNet 40 [Wu et al., 2016]<br />
** A comprehensive and clean collection of 3D CAD models for objects used as popular benchmark for 3D classification<br />
** List of the most common object categories in the world<br />
** 3D CAD models belonging to each object category using online search engines by querying for each object category<br />
** Manually annotated using hired human workers on Amazon Mechanical Turk to decide whether each CAD model belongs to the specified cateogries<br />
** ModelNet 10 & ModelNet 40 datasets completely cleaned in-house<br />
** Orientations of CAD models in ModelNet 10 are also manually aligned<br />
<br />
[[File:amirhk_eval_1.png|right|500px]]<br />
[[File:amirhk_eval_2.png|right|500px]]<br />
[[File:amirhk_eval_3.png|right|500px]]<br />
<br />
* ShapeNet [Chang et al., 2015]<br />
** Clean 3D models and manually verified category and alignment annotations<br />
** 55 common object categories with about 51,300 unique 3D models<br />
** Collaborative effort between researchers at Princeton, Stanford and Toyota Technological Institute at Chicago (TTIC)<br />
<br />
* IKEA Dataset [Lim et al., 2013]<br />
** 1039 objects centre-cropped from 759 images<br />
** Images captured in the wild, often w/ cluttered backgrounds and occluded<br />
** 6 categories: bed, bookcase, chair, desk, sofa, table<br />
<br />
=== Experiment 1: Qualitative results for 3D generated objects ===<br />
<br />
Figure 2 shows 3D objects generated by the 3D-GAN framework. To generate these objects, a 200-dimensional vector following a uniform distribution between [0,1] is passed as input to the generator, and the largest connected component in the output of the generator is taken as the generated 3D object. One 3D-GAN is trained for each object class.<br />
<br />
Unfortunately, measures of comparison for samples generated by generative models are qualitative and subjective. Here, the authors relatively compare samples generated by 3D-GAN with 3D objects synthesized from a probabilistic space [Wu et al., 2015], and those generated by volumetric auto-encoders [Girdhar et al., 2016]. It is important to consider how objects are generated using the volumetric auto-encoder while considering that auto-encoders do not restrict the latent space. To overcome this challenge, and to generate novel samples (rather than simply copying latent variables of samples in the training set), a Gaussian is fit to the emperical mean of the data. Samples drawn from this Gaussian act as the latent representation for a sample that is generated using the decoder of the volumetric auto-encoder.<br />
<br />
Results in Figure 2 demonstrate that 3D-GANs are able to synthesize high-resolution 3D objects with detailed geometries, and subjective comparisons are highly in favor of 3D objects generated by 3D-GAN. In Figure 2, the nearest neighbours of each generated object is also depicted in the 2 right-most columns. From this, we see that generated objects via the 3D-GAN framework are novel and do not simply copy components from samples in the training set.<br />
<br />
=== Experiment 2: Classification performance of learned representations w/o supervision ===<br />
<br />
Another experiment conducted by the authors was to understand the latent representations of the generated objects as encoded in the discriminator. A typical way to evaluate representations learned without supervision is to use them as features for classification. Therefore, for each generated object, the authors concatenate the responses of the second, third, and fourth convolutional layers in the discriminator resulting in 1 vector representation for a given 3D object (training sample or 3D generated sample). A linear SVM was then used to perform classification using these object representations. Here, a single 3D-GAN was trained on seven major ShapeNet classes (chairs, sofas, tables, boats, airplanes, rifles, and cars), but was evaluated using the objects in both ModelNet 10 and ModelNet 40. These results are even more insightful given that the training and test sets are not identical and therefore show the out-of-category generalization power of the 3D-GAN.<br />
<br />
Table 1 demonstrates the superior performance of 3D-GAN compared to competing unsupervised methods, and demonstrates performane on par with many supervised strategies. Only Multi-view CNNs, a method designed for classification (not generation of 3D objects) and augmented with ImageNet pretraining, is able to outperform 3D-GAN on discriminator representation classification.<br />
<br />
=== Experiment 3: 3D object reconstruction from a single image ===<br />
<br />
Following previous work [Girdhar et al., 2016] the performance of 3D-VAE-GAN was evaluated on the IKEA dataset to demonstrate how it performs for single image 3D reconstruction. The results in Figure 7 and Table 2 show the performance of both a single 3D-VAE-GAN jointly trained on all 6 IKEA object categories, and six 3D-VAE-GANs independently trained on each category. To evalute the performance of the models across different image setup, a 3D object was generated for permutations, flips, and translational alignments (up to \%10) of an input 3D image. Then the average of generated 3D objects was compared to the 3D ground truth for the 2D image.<br />
<br />
The results in this section show that 3D-VAE-GAN consistently outperform previous state-of-the-art method for voxel-level predictions.<br />
<br />
=== Experiment 4: Analyzing learned representations for generator and discriminator ===<br />
<br />
[[File:amirhk_representations_gen_1.png|right|500px]]<br />
[[File:amirhk_representations_gen_2.png|right|500px]]<br />
[[File:amirhk_representations_gen_3.png|right|650px]]<br />
[[File:amirhk_representations_disc.png|right|650px]]<br />
<br />
In this section we explore the learned representations of the generator and discriminator in a trained 3D-GAN. Starting with a 200-dimensional vector as input, the generator neurons will fire to generate a 3D object, consequently leading to the firing of neurons in the discriminator which will produce a confidence value between [0,1]. To understand the latent space of vectors for object generation, we first vary the intensity of each dimension in the latent vector and observe the effect on the generated 3D objects. In Figure 5, each red region marks the voxels affected by changing values in a particular dimesnion of the latent vector. It can be seen that semantic meaning such as width and thickness of surfaces is encoded in each of these dimensions.<br />
<br />
Next, we explore intra-class and inter-class object metamorphosis by interpolating between latent vector representation of a source and target 3D object. In Figure 6, we see a smooth transition exists for various types of chairs (w/ and w/o arm rests, and with varying backrest), as well as for a smooth transition between race car and speedboat.<br />
<br />
Next, as is common in generative model evaluations, a simple arithmetic scheme is tested on latent vector representation of 3D objects. In Figure 8 shows that not only are generative networks able to encode semantic knowledge of chair and face images in its latent space, but these learned representations behave similarly as well. This can be seen because simple arithmetic on latent vector representations works in accord with intuition in Figure 8.<br />
<br />
Finally, the authors explore the neurons in the discriminator. In order to understand firing patterns for specific neurons, the authors iterate through all training objects while keeping track of those samples that result in the highest firing intensity of a specific neuron. Here the neurons in the second-to-last convolutional layers were considered. From Figure 9, we conclude that neurons are selective: for a single neuron, the objects producing strongest activations are similar, and neurons learn semantics: the object parts that activate the neuron the most are consistent across objects.<br />
<br />
== Resources ==<br />
The authors provide some supplementary resources for their proposed novel methodology. We describe these supplementary resources in this section. <br />
<br />
Pre-trained models and sampling code for 3-D GAN can be found in the following git repository: https://github.com/zck119/3dgan-release. Implementations are done using Torch 7.<br />
<br />
= Summary of Contributions =<br />
<br />
In this work, we have presented a novel approach to 3D object generation. We described 3D-GANs, showed their architechture, discussed loss functions, dove into the intricacies of the training process, and demonstrated their ability in generating realistic and novel high-resolution 3D objects. Furthermore, we reviewed the performance of 3D-GANs in producing feature vectors for object recognition and showed how the features learned by the discriminator outperforms all unsupervised methods, and is competitive with many supervised strategies. We extended 3D-GANs to 3D-VAE-GANs and learned a mapping from 2D images to 3D objects corresponding to the 2D image. Using 3D-VAE-GANs the authors were able to reconstruct 3D objects from a single image, with far greater accuracy than previous methods. Finally, the neurons in learned networks were analyzed and it was shown that the neurons learn disentagled features and fire selectively for different objects while learning the semantics of the objects they fire for.<br />
<br />
<br />
<br />
= References =<br />
<br />
# Girdhar, Rohit, et al. ''Learning a predictable and generative vector representation for objects.'' European Conference on Computer Vision. Springer International Publishing, 2016.<br />
# Wu, Jiajun, et al. ''Single image 3d interpreter network.'' European Conference on Computer Vision. Springer International Publishing, 2016.<br />
# Wu, Zhirong, et al. ''3d shapenets: A deep representation for volumetric shapes.'' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.<br />
# Chang, Angel X., et al. ''Shapenet: An information-rich 3d model repository.'' arXiv preprint arXiv:1512.03012, 2015.<br />
# Larsen, Anders Boesen Lindbo, et al. ''Autoencoding beyond pixels using a learned similarity metric.'' arXiv preprint arXiv:1512.09300, 2015.<br />
# Lim, Joseph J., Hamed Pirsiavash, and Antonio Torralba. ''Parsing ikea objects: Fine pose estimation.'' Proceedings of the IEEE International Conference on Computer Vision, 2013.<br />
# Good explanation of Coupled GAN: https://wiseodd.github.io/techblog/2017/02/18/coupled_gan/<br />
# 2 Minute Video Summary: https://www.youtube.com/watch?v=HO1LYJb818Q<br />
# Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems (pp. 2234-2242).<br />
# Oord, A. van den, Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel Recurrent Neural Networks. arXiv:1601.06759 [Cs]. Retrieved from http://arxiv.org/abs/1601.06759<br />
# Karpathy, A., Abbeel, P., Brockman, G., Chen, P., Cheung, V., Duan, R., … Zaremba, W. (2016, June 16). Generative Models. Retrieved October 20, 2017, from https://blog.openai.com/generative-models/</div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Improved_Variational_Inference_with_Inverse_Autoregressive_Flow&diff=28481STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow2017-10-24T20:57:50Z<p>Jdeng: /* Black-box Variational Inference */</p>
<hr />
<div>==Introduction==<br />
<br />
One of the most common ways to formalize machine learning models is through the use of $\textbf{latent variable models}$, wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some $\textit{hidden variables}$. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. The $ \textbf{mixture of Gaussians}$ provides a particularly nice example of a latent variable model. One way to think about a mixture with $K$ Gaussians is as follows. First, roll a $K$-sided die and suppose that the result is $k$ with probabililty $\pi_{k}$. Then randomly generate a point from the Gaussian distribution with parameters $\mu_{k}$ and $\Sigma_{k}$. The reason this is a hidden variable model is that, when we have a dataset coming from a mixture of Gaussians, we only get to see the datapoints that are generated at the end. For a given observed datapoint we neither get to see the die that is rolled in generating that point nor do we know what the probabilities $\pi_{k}$ are. The $\pi_{k}$ are therefore hidden variables and, together with estimation of the parameters $\mu_{k}$, $\Sigma_{k}$ determining observations, estimating the $\pi_{k}$ constitutes inference within the mixture of Gaussians model. Note that all the parameters to be estimated can be wrapped into a long vector $\theta = (\pi_{1}, \ldots, \pi_{K}, \mu_{1}, \Sigma_{1}, \ldots, \mu_{K}, \Sigma_{K})$.<br />
<br />
More generally, latent variable models provide a powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. Therefore, it behoves us to obtain general methods which allow tractable inference within latent variable models. One such method is known as $\textbf{variational inference}$ and it, in its modern form, was introduced to machine learning around two decades ago in the seminal paper [jordanVI]. More recently, and more apropos of deep learning, stochastic versions of variational inference are being combined with neural networks to provide robust estimation of parameters in probabilistic models. The original impetus for this fusion apparently stems from publication of [autoencoderKingma] and [autoencoderRezende]. In the interim, a cottage industry for application of stochastic variational inference or methods related to it have seemingly sprung up, especially as witnessed by the variety of autoencoders currently being sold at the bazaar. The paper [946paper] represents another interesting contribution in parameter estimation by way of deep learning. Note that, at time of writing, variational methods are being applied to a wide range of problems in machine learning and we will only develop the small part of it necessary for our purposes. But refer to [VISurvey] for a survey.<br />
<br />
==Black-box Variational Inference==<br />
<br />
The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, often called the \textbf{generative model} in the literature, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. This assumption often holds in real-world applications since generative models for real phenomena are extremely difficult or impossible to integrate. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables and, by Bayes' rule, this requires computation of the marginal distribution $p_{\theta}(x)$. <br />
<br />
The variational inference approach entails positing a parametric family $q_{\phi}(h\mid x)$, also called the \textbf{inference model}, of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, we can think of a slightly indirect approach. We can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(h\mid x)$, we have<br />
\begin{align*}<br />
\log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\<br />
&= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\<br />
&= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\<br />
&= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&:= \mathcal{L}(x, \theta, \phi),<br />
\end{align*}<br />
where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textbf{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. <br />
<br />
Before moving on, we note that there are alternative ways of expressing the ELBO which can either provide insights or aid us in further calculations. For one alternative form, note that we can massage the ELBO like so.<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) + \log p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) \right] - \mathbb{E}_{q} \left[ \log q_{\phi}(h \mid x) - \log p_{\theta}(h \mid x) \right] \\<br />
= & \log p(x) - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p(h \mid x) \right].<br />
\end{align*}<br />
The last expression has a very simple interpretation: maximizing $\mathcal{L}(x, \theta, \phi)$ is equivalent to minimizing the KL divergence between the approximate posterior $q_{\phi}$ and the actual posterior $p_{\theta}(h \mid x)$. In fact, we can rewrite the above equation as a ``conservation law"<br />
\[<br />
\mathcal{L}(x, \theta, \phi) + \mathrm{KL} \left[ q_{\phi}(h \mid x) || p(h \mid x) \right] = \log p(x).<br />
\]<br />
On the other hand, we can also do<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) + \log p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right] - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p_{\theta}(h) \right].<br />
\end{align*} <br />
and the hermeneutics here is a bit more interesting. Recall that $q_{\phi}(h \mid x)$ is a distribution we get to choose and choosing a ``good" distribution means choosing something which we believe is faithful to the way observations get ``encoded" or ``compressed" into ``hidden representations". Conversely, $p_{\theta}(x \mid h)$ may be thought as a ``decoder" which unpacks latent ``codes" into observations. Thus, we can think of $\mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right]$ as the expected reconstruction error when we use $q_{\phi}$ as an encoder. The KL term is now interpreted as a regularizer which restricts divergence of the encoder from the prior distribution over latent codes. Note that these remarks simply provide an intuition and even though we use descriptors such as ``encoder" and ``decoder", there is no \textit{a priori} reason to implement the distributions involved as encoder and decoder networks as in an autoencoder. Indeed, there is nothing preventing us from even letting $q_{\phi}$ compute an ``overcomplete" feature representation $h$ of $x$ (i.e., dimensionality of $h$ is greater than that of $x$).<br />
<br />
Regardless of which ELBO we use, inference requires the gradients of $\mathcal{L}(x, \theta, \phi)$. Notice that, no matter what, there are expectations with respect to $q_{\phi}$ involved and the presence of these expectations persists into the gradients. As an example, let us compute the gradients with the ELBO written as <br />
\[<br />
\mathcal{L}(x, \theta, \phi) = \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right].<br />
\]<br />
The gradient with respect to $\theta$ is easy.<br />
\begin{align*}<br />
\nabla_{\theta}\mathcal{L}(x,\theta,\phi) &= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\theta}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\int_{h}\left[q_{\phi}(h\mid x)\log p_{\theta}(x,h)\right] \\<br />
&= \int_{h}q_{\phi}(h\mid x)\nabla_{\theta}\log p_{\theta}(x,h) \\<br />
&= \mathbb{E}_{q}\left[\nabla_{\theta}\log p_{\theta}(x,h)\right].<br />
\end{align*} <br />
For the derivative with respect to the variational parameters $\phi$, we are going to exploit the identities $$\int_{h}\nabla_{\phi} q_{\phi}(h\mid x)=\nabla_{\phi}\int_{h}q_{\phi}(h\mid x)=\nabla_{\phi}1=0$$ and $$q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x)=\nabla_{\phi}q_{\phi}(h\mid x).$$ Note that the second identity will be used in a ``backwards" direction toward the end of the derivation below. We now have<br />
\begin{align*}<br />
\nabla_{\phi}\mathcal{L}(x, \theta, \phi) &= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) -\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\phi}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log p_{\theta}(x,h) - \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(q_{\phi}(h\mid x)\frac{\nabla_{\phi}q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(\nabla_{\phi}q_{\phi}(h\mid x) + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - 0 - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x) \\<br />
&= \mathbb{E}_{q}\left[\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}\log q_{\phi}(h\mid x)\right].<br />
\end{align*}<br />
<br />
Observe that everything we have done so far is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led Ranganath et al. <ref name="bbvi"> Rajesh Ranganath, Sean Gerrish and David M. Blei. Black Box Variational Inference. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2014, Reykjavik, Iceland, April 22-25, 2014</ref> \cite{bbvi} to christen it \textbf{black-box variational inference}. The price we pay for such generality is that we have to calculate expectations against the distribution $q_{\phi}$. Broadly speaking, such expectations are intractable integrals for any approximate posterior approaching verisimilitude. However, the notion of \textbf{normalizing flow} represents a technical innovation which allows use of flexible posteriors while maintaining tractability through calculation of expectations against simple distributions (e.g., Gaussians) only.<br />
<br />
The Black-box Variational Inference framework actually can imply the expectation maximization algorithm, which is an iterative method to find maximum likelihood and maximum a posteriori in the statistical model with latent variables. It has been successful implementation on the problems like Gaussian mixture models. <br />
<br />
Recalling the maximum likelihood is expressed as below. <br />
\[<br />
\mathcal{L}(x,\theta,\phi) = \mathbb{E}_q [\log p_\theta(x \mid h)] - \mathrm{KL}[q_{\phi}(h \mid x) || p(h \mid x)]<br />
\]<br />
As mentioned before, we need to set $q_{\phi_k}(h\mid x) = p_{\theta_k}(h)$ to minimize the KL divergence with a given estimation of $\theta_k$. Then the maximization step is to calculate that <br />
\[<br />
\mathcal{L}(x, \theta,\phi_k) = \mathbb{E}_{q_{\phi_k}} [\log p_{\theta}(x \mid h)]<br />
\]<br />
<br />
And $\theta_{k+1}$ is updated in the maximization step as<br />
\[<br />
\theta_{k+1} = argmax_{\theta} \mathbb{E}_{q_{\phi_k}}[\log p_{\theta}(x \mid h)]<br />
\]<br />
<br />
==Variational Inference using Normalizing Flows==<br />
<br />
While our main goal is to describe \cite{946paper}, the paper \cite{normalizing_flow} provides the necessary conceptual backdrop for \cite{946paper} and we will now take a detour through some of the points presented in the latter. The main contribution of \cite{normalizing_flow} lies in a novel technique for creating a rich class of approximate posteriors starting from relatively simple ones. This is important since one of the main drawbacks to the variational approach is that it requires assumptions on the form of the approximate posterior $q_{\phi}(h\mid x)$ and practicality often forces us to stick to simple distributions which fail to capture rich, multimodal properties of the true posterior $p(h \mid x)$. The primary technical tool used in \cite{normalizing_flow} to achieve complexity in the approximate posterior is what we earlier referred to as normalizing flow, which entails using a series of invertible functions to transform simple probability densities into more complex densities. <br />
<br />
Suppose we have a random variable $h$ with probability density $q(h)$ and that $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an invertible function with inverse $g:\mathbb{R}^{d}\to\mathbb{R}^{d}$. A basic result in probability states that the random variable $h'\colon = f(h)$ has distribution $$q'(h') = q(h)\Bigg\lvert\det\frac{\partial f}{\partial h}\Bigg\rvert^{-1}.$$ Chaining together a (finite) sequence of invertible maps $f_{1},\ldots,f_{K}$ and applying it to the distribution $q_{0}$ of a random variable $h_{0}$ leads to the formula $$q_{K}(h_{K}) = q_{0}(h_{0})\prod\limits^{K}_{k=1}\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert^{-1},$$ where $h_{k}\colon = f_{k}(h_{k-1})$ and $q_{k}$ is the distribution associated to $h_{k}$. We can equivalently rewrite the above equation as $$\log q_{K}(h_{K}) = \log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert.$$ So, if we were to start from a simple distribution $q_{0}(h_{0})$, choose a sequence of functions $f_{1},\ldots,f_{K}$ and then \textit{define} $$q_{\phi}(h \mid x)\colon = q_{K}(h_{K}),$$ we can manipulate the ELBO as follows:<br />
\begin{align*}<br />
\mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h) - \log q_{K}(z_{K})\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right].<br />
\end{align*} <br />
The reason this is an useful thing to do is the \textbf{law of the unconscious statistician (LOTUS)} as applied to $q_{K} = f_{K}\circ\cdots\circ f_{1}(q_{0})$: $$\mathbb{E}_{q_{K}}\left[s(h_{K})\right] = \mathbb{E}_{q_{0}}\left[s(f_{K}\circ\cdots\circ f_{1}(h_{0}))\right]$$ assuming that $s$ does not depend on $q_{K}$. Hence,<br />
\begin{align*}<br />
\mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{K}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\<br />
&= \mathbb{E}_{q_{0}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{0}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{0}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] <br />
\end{align*}<br />
and we are only computing expectations with respect to the simple distribution $q_{0}$. The latter expression for ELBO is called the \textbf{flow-based free energy bound} in \cite{normalizing_flow}. Note that $\theta$ has apparently disappeared in the final expression even though it is still present in $\mathcal{L}(x,\theta,\phi)$. This is an illusion: the parameters $\phi$ are associated to $q_{\phi}(h\mid x) = q_{K}(h_{K})$ and $q_{K}(h_{K})$ depends on the quantities $q_{0}(h_{0})$ and $\frac{\partial f_{k}}{\partial h_{k-1}}$. Thus, $\phi$ now encapsulates the defining parameters of $q_{0}$ and the $f_{k}$. <br />
<br />
To summarize, we can start with a very simple distribution $q_{0}$ against which expectations are easy to calculate and if we can cleverly choose a series of invertible functions $\{f_{k}\}^{K}_{k=1}$ for which it is easy to compute determinants of the Jacobians, we can get a relatively rich approximate posterior $q_{K}$ such that the ELBO and its gradients are tractable. We should now ask: what is a suitable family of functions which can serve as a normalizing flow?<br />
<br />
==Inverse Autoregressive Flow==<br />
<br />
The answer presented by \cite{946paper} to the last question is<br />
\[<br />
f_{k}(h_{k-1}) := \mu_{k} + \sigma_{k} \odot h_{k-1},<br />
\]<br />
where $\odot$ means element-wise multiplication, $\mu_{k}$, $\sigma_{k}$ are outputs from an autoregressive neural network with inputs $h_{k-1}$ and an extra constant vector $c$ and we initialize with<br />
\[<br />
h_{0} := \mu_{0} + \sigma_{0} \odot \epsilon<br />
\]<br />
such that $\epsilon \!\sim \mathcal{N}(0, I)$. The authors of \cite{946paper} call this series of functions the \textbf{inverse autoregressive flow}. We will solve the mystery of where this definition comes from later. The important point is that the functional form of $f_{k}$ is parametrized by the outputs of an autoregressive neural network and this implies that the Jacobians<br />
\[<br />
\frac{\partial \mu_{k}}{\partial h_{k-1}}, \frac{\partial \sigma_{k}}{\partial h_{k-1}}<br />
\]<br />
are lower triangular with zeroes on the diagonal (this is not a trivial fact since $\mu_{k}$ and $\sigma_{k}$ are some complicated functions of $h_{k-1}$ -- they are outputs of a neural network which takes $h_{k-1}$ as an input). Hence, the derivative<br />
\[<br />
\frac{\partial f_{k}}{\partial h_{k-1}}<br />
\]<br />
is a triangular matrix with the entries of $\sigma_{k}$ occupying the diagonal. The determinant of this is obviously just<br />
\[<br />
\prod\limits^{D}_{i=1} \sigma_{k, i},<br />
\]<br />
and it is very cheap to compute. Note also that the approximate posterior that comes out of this normalizing flow is<br />
\[<br />
\log q_{K}(h_{K}) = - \sum\limits^{D}_{i=1} \left[ \frac{1}{2} \epsilon_{i}^{2} + \frac{1}{2} \log 2\pi + \sum\limits^{K}_{k=1} \log \sigma_{k,i} \right].<br />
\]<br />
In conclusion, we have a nice expression for the ELBO. To round out the parsimony of the ELBO, we need an inference model which is computationally cheap to evaluate. Additionally, we typically do not calculate the ELBO gradients analytically but instead perform Monte Carlo estimation by sampling from the inference model. This requires inexpensive sampling from the inference model. Since expectations are calculated only with respect to the simple initial distribution $q_{0}$, both of these requirements are easily satisfied. <br />
<br />
==Inverse Autoregressive Transformations or, Whence Inverse Autoregressive Flow?==<br />
<br />
Once we have the inverse autoregressive flow, the main result of \cite{946paper} falls out. Let us consider how we could have come up with the idea of the inverse autoregressive flow. It will be helpful to start with a discussion of \textbf{autoregressive neural networks}, which we briefly alluded to when defining the flow. As the Latin prefix suggests, autoregression means that we deduce components of a random vector $h$ based on its \textit{own} components. More precisely, the $d^{th}$ element of $h$ depends on the preceding components $h_{1:d-1}$.<br />
<br />
To elucidate this further, we shall follow the introductory exposition presented in \cite{MADE}. Let us consider a very simple autoencoder with just one hidden layer. That is, we have a feedforward neural network defined by<br />
\begin{align*}<br />
r(h) &:= g(b + Wh) \\<br />
\hat{h} & := \mathrm{sigm}(c + Vr(h)),<br />
\end{align*} <br />
where $W$, $V$ are matrices of weights, $b$, $c$ are biases, $g$ is some non-linearity and $\mathrm{sigm}$ is element-wise sigmoid. Here, $r(h)$ is thought of as a hidden representation of the input $h$ and $\hat{h}$ is a reconstructed version of $h$. For simplicity, suppose that $h$ is a $D$-ary binary vector. Then we can measure the quality of our reconstruction using cross-entropy<br />
\[<br />
l(h) := - \sum\limits h_{d} \log \hat{h}_{d} + (1 - h_{d}) \log (1 - \hat{h}_{d}).<br />
\]<br />
It is tempting to interpret $l(h)$ as a negative log-likelihood induced by the distribution<br />
\[<br />
\prod\limits^{D}_{d=1} \hat{h}_{d}^{h_{d}} (1 - \hat{h}_{d})^{1 - h_{d}}.<br />
\]<br />
However, absent restrictions on the above expression, this is in general \textit{not} the case. As an example, suppose our hidden layer has as many units as the input layer. Then it is possible to drive the cross-entropy loss to $0$ by copying the input into the hidden layer. In this situation, $q(h) = 1$ for every possible $h$ and $q(h)$ is seen to actually not define a probability distribution. <br />
<br />
If $l(h)$ is indeed to be a negative log-likelihood, i.e.,<br />
\[<br />
l(h) = - \log p(h)<br />
\]<br />
for a genuine probability distribution $p(h)$, it must satisfy<br />
\begin{align*}<br />
l(h) = & - \sum\limits^{D}_{d=1} \log p(h_{d} \mid h_{1:d-1}) \\<br />
= & - \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d}) \log p(h_{d} = 0 \mid h_{1:d-1}) \\<br />
= & - \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d})(1 - \log p(h_{d} = 1 \mid h_{1:d-1})).<br />
\end{align*}<br />
The first equation is just the chain rule of probability<br />
\[<br />
p(h) = \prod\limits^{D}_{d=1} p(h_{d} \mid h_{1:d-1}),<br />
\]<br />
the second equation is true because of our assumption that each entry of $h$ is either $0$ or $1$ and the third equation holds due to the fact that $p(h)$ is a probability distribution. Comparing the naive cross-entropy loss<br />
\[<br />
- \sum\limits h_{d} \log \hat{h}_{d} + (1 - h_{d}) \log (1 - \hat{h}_{d})<br />
\]<br />
with the term<br />
\[<br />
- \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d})(1 - \log p(h_{d} = 1 \mid h_{1:d-1})),<br />
\]<br />
we see that a correct reconstruction (``correct" in the sense that the loss function is a negative log-likelihood) needs to satisfy<br />
\[<br />
\hat{h}_{d} = \log p(h_{d} = 1 \mid h_{1:d-1}).<br />
\] <br />
<br />
More generally, for a (deep) autoencoder we can require the reconstructed vector to have components satisfying<br />
\[<br />
\hat{h}_{d} = p(h_{d} \mid h_{1:d-1}).<br />
\]<br />
In other words, the $d^{th}$ component is the probability of observing $h_{d}$ given the preceding components $h_{1:d-1}$. This latter property is known as the \textbf{autoregressive property} since we can think of it as sequentially performing regression on the components of $h$. Unsurprisingly, an autoencoder satisfying the autoregressive property is called an \textbf{autoregressive autoencoder}.<br />
<br />
Suppose now that we have an autoregressive autoencoder which takes an input vector $\mathbf{y} \in \mathbb{R}^{D}$ and we interpret the outputs of this network as parameters for a normal distribution. Write $[\mathbf{\mu}(\mathbf{y}),\mathbf{\sigma}(\mathbf{y})]$ for such output. The autoregressive structure implies that, for $j \in \{1, \ldots, D\}$, $\mathbf{y}_{j}$ depends only on the components $\mathbf{y}_{1:j-1}$. Therefore, if we take the vector $[\mathbf{\mu}_{i}, \mathbf{\sigma}_{i}]$ and compute the derivative with respect to $\mathbf{y}$, we will obtain a lower triangular matrix since<br />
\[<br />
\frac{\partial [\mathbf{\mu}_{i}, \mathbf{\sigma}_{i}]}{\partial \mathbf{y}_{j}} = [0, 0]<br />
\]<br />
whenever $j \geq i$. We interpret the vector $[\mathbf{\mu}_{i}(\mathbf{y}_{1:j-1}), \mathbf{\sigma}_{i}(\mathbf{y}_{1:j-1})]$ as being the predicted mean and standard deviation of the $i^{th}$ element of (the reconstruction of) $\mathbf{y}$. In slightly more detail, the components of $\mathbf{y}$ are successively generated via<br />
\begin{align*}<br />
& \mathbf{y}_{0} = \mathbf{\mu}_{0} + \mathbf{\sigma}_{0} \cdot \mathbf{\epsilon}_{0}, \\<br />
& \mathbf{y}_{i} = \mathbf{\mu}_{i}(\mathbf{y}_{1:i-1}) + \mathbf{\sigma}_{i}(\mathbf{y}_{1:i-1}) \cdot \mathbf{\epsilon}_{i},<br />
\end{align*}<br />
where $\mathbf{\epsilon} \!\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. <br />
<br />
To relate this back to the normalizing flow chosen by the authors of \cite{946paper}, replace $\mathbf{y}$ with $h_{k}$ as input to the autoregressive autoencoder and replace the outputs $\mathbf{\mu}$, $\mathbf{\sigma}$ with $\mu_{k}$, $\sigma_{k}$.<br />
<br />
== Implementation == <br />
[[File:implementation.PNG]]<br />
<br />
== Experiments==<br />
The authors showcase the potential for improvement of variational autoencoders afforded by the novel IAF methodology they proposed by providing empirical evidence in the context of two key benchmark datasets: MNIST and CIFAR 10.<br />
<br />
'''MNIST DATASET:'''<br />
The results of the experiment authors conducted are displayed in the figure below. [[File:IAF MINST.PNG]]<br />
<br />
They provide some evidence as to the improvement of the variational inference via IAF when compared with other state of the art techniques.<br />
<br />
'''CIFAR 10:'''<br />
<br />
The results for the CIFAR 10 data set is displayed in the figure below. [[File:CIFAR 10.PNG]]<br />
<br />
==Concluding Remarks==<br />
<br />
In wrapping up, we note that there is something interesting about how the normalizing flow is derived. Essentially, the authors of \cite{946paper} took a neural network model with nice properties (fast sampling, simple Jacobian, etc.), looked at the function it implemented and basically dropped in this function in the recursive definition of the normalizing flow. This is not an isolated case. The authors of \cite{normalizing_flow} do much the same thing in coming up with the flow <br />
\[<br />
f_{k}(h_{k}) = h_{k} + u_{k}s(w_{k}^{T}h_{k} + b_{k}).<br />
\]<br />
We believe that this flow is implicitly justified by the fact that functions of the above form are implemented by deep latent Gaussian models (see \cite{autoencoderRezende}). These flows, while interesting and useful, probably do not exhaust the possibilities for tractable and practical normalizing flows. It may be an interesting project to try and come up with novel normalizing flows by taking a favorite neural network architecture and using the function implemented by it as a flow. Additionally, it may be worth exploring boutique normalizing flows to improve variational inference in domain-specific settings (e.g., use a normalizing flow induced by a parsimonious convolutional neural network architecture for training an image-processing model using variational inference).<br />
<br />
== List of Figures == <br />
[[File:fig1.PNG]]<br />
[[File:fig2.PNG]]<br />
[[File:fig3.PNG]]<br />
<br />
==References==<br />
<br />
<references/></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Improved_Variational_Inference_with_Inverse_Autoregressive_Flow&diff=28480STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow2017-10-24T20:55:16Z<p>Jdeng: /* Black-box Variational Inference */</p>
<hr />
<div>==Introduction==<br />
<br />
One of the most common ways to formalize machine learning models is through the use of $\textbf{latent variable models}$, wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some $\textit{hidden variables}$. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. The $ \textbf{mixture of Gaussians}$ provides a particularly nice example of a latent variable model. One way to think about a mixture with $K$ Gaussians is as follows. First, roll a $K$-sided die and suppose that the result is $k$ with probabililty $\pi_{k}$. Then randomly generate a point from the Gaussian distribution with parameters $\mu_{k}$ and $\Sigma_{k}$. The reason this is a hidden variable model is that, when we have a dataset coming from a mixture of Gaussians, we only get to see the datapoints that are generated at the end. For a given observed datapoint we neither get to see the die that is rolled in generating that point nor do we know what the probabilities $\pi_{k}$ are. The $\pi_{k}$ are therefore hidden variables and, together with estimation of the parameters $\mu_{k}$, $\Sigma_{k}$ determining observations, estimating the $\pi_{k}$ constitutes inference within the mixture of Gaussians model. Note that all the parameters to be estimated can be wrapped into a long vector $\theta = (\pi_{1}, \ldots, \pi_{K}, \mu_{1}, \Sigma_{1}, \ldots, \mu_{K}, \Sigma_{K})$.<br />
<br />
More generally, latent variable models provide a powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. Therefore, it behoves us to obtain general methods which allow tractable inference within latent variable models. One such method is known as $\textbf{variational inference}$ and it, in its modern form, was introduced to machine learning around two decades ago in the seminal paper [jordanVI]. More recently, and more apropos of deep learning, stochastic versions of variational inference are being combined with neural networks to provide robust estimation of parameters in probabilistic models. The original impetus for this fusion apparently stems from publication of [autoencoderKingma] and [autoencoderRezende]. In the interim, a cottage industry for application of stochastic variational inference or methods related to it have seemingly sprung up, especially as witnessed by the variety of autoencoders currently being sold at the bazaar. The paper [946paper] represents another interesting contribution in parameter estimation by way of deep learning. Note that, at time of writing, variational methods are being applied to a wide range of problems in machine learning and we will only develop the small part of it necessary for our purposes. But refer to [VISurvey] for a survey.<br />
<br />
==Black-box Variational Inference==<br />
<br />
The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, often called the \textbf{generative model} in the literature, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. This assumption often holds in real-world applications since generative models for real phenomena are extremely difficult or impossible to integrate. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables and, by Bayes' rule, this requires computation of the marginal distribution $p_{\theta}(x)$. <br />
<br />
The variational inference approach entails positing a parametric family $q_{\phi}(h\mid x)$, also called the \textbf{inference model}, of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, we can think of a slightly indirect approach. We can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(h\mid x)$, we have<br />
\begin{align*}<br />
\log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\<br />
&= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\<br />
&= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\<br />
&= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&:= \mathcal{L}(x, \theta, \phi),<br />
\end{align*}<br />
where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textbf{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. <br />
<br />
Before moving on, we note that there are alternative ways of expressing the ELBO which can either provide insights or aid us in further calculations. For one alternative form, note that we can massage the ELBO like so.<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) + \log p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) \right] - \mathbb{E}_{q} \left[ \log q_{\phi}(h \mid x) - \log p_{\theta}(h \mid x) \right] \\<br />
= & \log p(x) - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p(h \mid x) \right].<br />
\end{align*}<br />
The last expression has a very simple interpretation: maximizing $\mathcal{L}(x, \theta, \phi)$ is equivalent to minimizing the KL divergence between the approximate posterior $q_{\phi}$ and the actual posterior $p_{\theta}(h \mid x)$. In fact, we can rewrite the above equation as a ``conservation law"<br />
\[<br />
\mathcal{L}(x, \theta, \phi) + \mathrm{KL} \left[ q_{\phi}(h \mid x) || p(h \mid x) \right] = \log p(x).<br />
\]<br />
On the other hand, we can also do<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) + \log p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right] - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p_{\theta}(h) \right].<br />
\end{align*} <br />
and the hermeneutics here is a bit more interesting. Recall that $q_{\phi}(h \mid x)$ is a distribution we get to choose and choosing a ``good" distribution means choosing something which we believe is faithful to the way observations get ``encoded" or ``compressed" into ``hidden representations". Conversely, $p_{\theta}(x \mid h)$ may be thought as a ``decoder" which unpacks latent ``codes" into observations. Thus, we can think of $\mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right]$ as the expected reconstruction error when we use $q_{\phi}$ as an encoder. The KL term is now interpreted as a regularizer which restricts divergence of the encoder from the prior distribution over latent codes. Note that these remarks simply provide an intuition and even though we use descriptors such as ``encoder" and ``decoder", there is no \textit{a priori} reason to implement the distributions involved as encoder and decoder networks as in an autoencoder. Indeed, there is nothing preventing us from even letting $q_{\phi}$ compute an ``overcomplete" feature representation $h$ of $x$ (i.e., dimensionality of $h$ is greater than that of $x$).<br />
<br />
Regardless of which ELBO we use, inference requires the gradients of $\mathcal{L}(x, \theta, \phi)$. Notice that, no matter what, there are expectations with respect to $q_{\phi}$ involved and the presence of these expectations persists into the gradients. As an example, let us compute the gradients with the ELBO written as <br />
\[<br />
\mathcal{L}(x, \theta, \phi) = \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right].<br />
\]<br />
The gradient with respect to $\theta$ is easy.<br />
\begin{align*}<br />
\nabla_{\theta}\mathcal{L}(x,\theta,\phi) &= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\theta}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\int_{h}\left[q_{\phi}(h\mid x)\log p_{\theta}(x,h)\right] \\<br />
&= \int_{h}q_{\phi}(h\mid x)\nabla_{\theta}\log p_{\theta}(x,h) \\<br />
&= \mathbb{E}_{q}\left[\nabla_{\theta}\log p_{\theta}(x,h)\right].<br />
\end{align*} <br />
For the derivative with respect to the variational parameters $\phi$, we are going to exploit the identities $$\int_{h}\nabla_{\phi} q_{\phi}(h\mid x)=\nabla_{\phi}\int_{h}q_{\phi}(h\mid x)=\nabla_{\phi}1=0$$ and $$q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x)=\nabla_{\phi}q_{\phi}(h\mid x).$$ Note that the second identity will be used in a ``backwards" direction toward the end of the derivation below. We now have<br />
\begin{align*}<br />
\nabla_{\phi}\mathcal{L}(x, \theta, \phi) &= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) -\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\phi}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log p_{\theta}(x,h) - \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(q_{\phi}(h\mid x)\frac{\nabla_{\phi}q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(\nabla_{\phi}q_{\phi}(h\mid x) + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - 0 - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x) \\<br />
&= \mathbb{E}_{q}\left[\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}\log q_{\phi}(h\mid x)\right].<br />
\end{align*}<br />
<br />
Observe that everything we have done so far is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led Ranganath et al. <ref name="bbvi"> Rajesh Ranganath, Sean Gerrish and David M. Blei. Black Box Variational Inference. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2014, Reykjavik, Iceland, April 22-25, 2014</ref> \cite{bbvi} to christen it \textbf{black-box variational inference}. The price we pay for such generality is that we have to calculate expectations against the distribution $q_{\phi}$. Broadly speaking, such expectations are intractable integrals for any approximate posterior approaching verisimilitude. However, the notion of \textbf{normalizing flow} represents a technical innovation which allows use of flexible posteriors while maintaining tractability through calculation of expectations against simple distributions (e.g., Gaussians) only.<br />
<br />
The Black-box Variational Inference framework actually can imply the expectation maximization algorithm, which is an iterative method to find maximum likelihood and maximum a posteriori in the statistical model with latent variables. It has been successful implementation on the problems like Gaussian mixture models. <br />
<br />
Recalling the maximum likelihood is expressed as below. <br />
\[<br />
\mathcal{L}(x,\theta,\phi) = \mathbb{E}_q [\log p_\theta(x \mid h)] - \mathrm{KL}[q_{\phi}(h \mid x) || p(h \mid x)]<br />
\]<br />
As mentioned before, we need to set $q_{\phi_k}(h\mid x) = p_{\theta_k}(h)$ to minimize the KL divergence with a given estimation of $\theta_k$. Then the maximization step is to calculate that <br />
\[<br />
\mathcal{L}(x, \theta,\phi_k) = \mathbb{E}_{q_{\phi_k}} [\log p_{\theta}(x \mid h)]<br />
\]<br />
<br />
And $\theta_{k+1}$ is updated in the maximization step as<br />
<br />
==Variational Inference using Normalizing Flows==<br />
<br />
While our main goal is to describe \cite{946paper}, the paper \cite{normalizing_flow} provides the necessary conceptual backdrop for \cite{946paper} and we will now take a detour through some of the points presented in the latter. The main contribution of \cite{normalizing_flow} lies in a novel technique for creating a rich class of approximate posteriors starting from relatively simple ones. This is important since one of the main drawbacks to the variational approach is that it requires assumptions on the form of the approximate posterior $q_{\phi}(h\mid x)$ and practicality often forces us to stick to simple distributions which fail to capture rich, multimodal properties of the true posterior $p(h \mid x)$. The primary technical tool used in \cite{normalizing_flow} to achieve complexity in the approximate posterior is what we earlier referred to as normalizing flow, which entails using a series of invertible functions to transform simple probability densities into more complex densities. <br />
<br />
Suppose we have a random variable $h$ with probability density $q(h)$ and that $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an invertible function with inverse $g:\mathbb{R}^{d}\to\mathbb{R}^{d}$. A basic result in probability states that the random variable $h'\colon = f(h)$ has distribution $$q'(h') = q(h)\Bigg\lvert\det\frac{\partial f}{\partial h}\Bigg\rvert^{-1}.$$ Chaining together a (finite) sequence of invertible maps $f_{1},\ldots,f_{K}$ and applying it to the distribution $q_{0}$ of a random variable $h_{0}$ leads to the formula $$q_{K}(h_{K}) = q_{0}(h_{0})\prod\limits^{K}_{k=1}\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert^{-1},$$ where $h_{k}\colon = f_{k}(h_{k-1})$ and $q_{k}$ is the distribution associated to $h_{k}$. We can equivalently rewrite the above equation as $$\log q_{K}(h_{K}) = \log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert.$$ So, if we were to start from a simple distribution $q_{0}(h_{0})$, choose a sequence of functions $f_{1},\ldots,f_{K}$ and then \textit{define} $$q_{\phi}(h \mid x)\colon = q_{K}(h_{K}),$$ we can manipulate the ELBO as follows:<br />
\begin{align*}<br />
\mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h) - \log q_{K}(z_{K})\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right].<br />
\end{align*} <br />
The reason this is an useful thing to do is the \textbf{law of the unconscious statistician (LOTUS)} as applied to $q_{K} = f_{K}\circ\cdots\circ f_{1}(q_{0})$: $$\mathbb{E}_{q_{K}}\left[s(h_{K})\right] = \mathbb{E}_{q_{0}}\left[s(f_{K}\circ\cdots\circ f_{1}(h_{0}))\right]$$ assuming that $s$ does not depend on $q_{K}$. Hence,<br />
\begin{align*}<br />
\mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\<br />
&= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{K}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\<br />
&= \mathbb{E}_{q_{0}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{0}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{0}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] <br />
\end{align*}<br />
and we are only computing expectations with respect to the simple distribution $q_{0}$. The latter expression for ELBO is called the \textbf{flow-based free energy bound} in \cite{normalizing_flow}. Note that $\theta$ has apparently disappeared in the final expression even though it is still present in $\mathcal{L}(x,\theta,\phi)$. This is an illusion: the parameters $\phi$ are associated to $q_{\phi}(h\mid x) = q_{K}(h_{K})$ and $q_{K}(h_{K})$ depends on the quantities $q_{0}(h_{0})$ and $\frac{\partial f_{k}}{\partial h_{k-1}}$. Thus, $\phi$ now encapsulates the defining parameters of $q_{0}$ and the $f_{k}$. <br />
<br />
To summarize, we can start with a very simple distribution $q_{0}$ against which expectations are easy to calculate and if we can cleverly choose a series of invertible functions $\{f_{k}\}^{K}_{k=1}$ for which it is easy to compute determinants of the Jacobians, we can get a relatively rich approximate posterior $q_{K}$ such that the ELBO and its gradients are tractable. We should now ask: what is a suitable family of functions which can serve as a normalizing flow?<br />
<br />
==Inverse Autoregressive Flow==<br />
<br />
The answer presented by \cite{946paper} to the last question is<br />
\[<br />
f_{k}(h_{k-1}) := \mu_{k} + \sigma_{k} \odot h_{k-1},<br />
\]<br />
where $\odot$ means element-wise multiplication, $\mu_{k}$, $\sigma_{k}$ are outputs from an autoregressive neural network with inputs $h_{k-1}$ and an extra constant vector $c$ and we initialize with<br />
\[<br />
h_{0} := \mu_{0} + \sigma_{0} \odot \epsilon<br />
\]<br />
such that $\epsilon \!\sim \mathcal{N}(0, I)$. The authors of \cite{946paper} call this series of functions the \textbf{inverse autoregressive flow}. We will solve the mystery of where this definition comes from later. The important point is that the functional form of $f_{k}$ is parametrized by the outputs of an autoregressive neural network and this implies that the Jacobians<br />
\[<br />
\frac{\partial \mu_{k}}{\partial h_{k-1}}, \frac{\partial \sigma_{k}}{\partial h_{k-1}}<br />
\]<br />
are lower triangular with zeroes on the diagonal (this is not a trivial fact since $\mu_{k}$ and $\sigma_{k}$ are some complicated functions of $h_{k-1}$ -- they are outputs of a neural network which takes $h_{k-1}$ as an input). Hence, the derivative<br />
\[<br />
\frac{\partial f_{k}}{\partial h_{k-1}}<br />
\]<br />
is a triangular matrix with the entries of $\sigma_{k}$ occupying the diagonal. The determinant of this is obviously just<br />
\[<br />
\prod\limits^{D}_{i=1} \sigma_{k, i},<br />
\]<br />
and it is very cheap to compute. Note also that the approximate posterior that comes out of this normalizing flow is<br />
\[<br />
\log q_{K}(h_{K}) = - \sum\limits^{D}_{i=1} \left[ \frac{1}{2} \epsilon_{i}^{2} + \frac{1}{2} \log 2\pi + \sum\limits^{K}_{k=1} \log \sigma_{k,i} \right].<br />
\]<br />
In conclusion, we have a nice expression for the ELBO. To round out the parsimony of the ELBO, we need an inference model which is computationally cheap to evaluate. Additionally, we typically do not calculate the ELBO gradients analytically but instead perform Monte Carlo estimation by sampling from the inference model. This requires inexpensive sampling from the inference model. Since expectations are calculated only with respect to the simple initial distribution $q_{0}$, both of these requirements are easily satisfied. <br />
<br />
==Inverse Autoregressive Transformations or, Whence Inverse Autoregressive Flow?==<br />
<br />
Once we have the inverse autoregressive flow, the main result of \cite{946paper} falls out. Let us consider how we could have come up with the idea of the inverse autoregressive flow. It will be helpful to start with a discussion of \textbf{autoregressive neural networks}, which we briefly alluded to when defining the flow. As the Latin prefix suggests, autoregression means that we deduce components of a random vector $h$ based on its \textit{own} components. More precisely, the $d^{th}$ element of $h$ depends on the preceding components $h_{1:d-1}$.<br />
<br />
To elucidate this further, we shall follow the introductory exposition presented in \cite{MADE}. Let us consider a very simple autoencoder with just one hidden layer. That is, we have a feedforward neural network defined by<br />
\begin{align*}<br />
r(h) &:= g(b + Wh) \\<br />
\hat{h} & := \mathrm{sigm}(c + Vr(h)),<br />
\end{align*} <br />
where $W$, $V$ are matrices of weights, $b$, $c$ are biases, $g$ is some non-linearity and $\mathrm{sigm}$ is element-wise sigmoid. Here, $r(h)$ is thought of as a hidden representation of the input $h$ and $\hat{h}$ is a reconstructed version of $h$. For simplicity, suppose that $h$ is a $D$-ary binary vector. Then we can measure the quality of our reconstruction using cross-entropy<br />
\[<br />
l(h) := - \sum\limits h_{d} \log \hat{h}_{d} + (1 - h_{d}) \log (1 - \hat{h}_{d}).<br />
\]<br />
It is tempting to interpret $l(h)$ as a negative log-likelihood induced by the distribution<br />
\[<br />
\prod\limits^{D}_{d=1} \hat{h}_{d}^{h_{d}} (1 - \hat{h}_{d})^{1 - h_{d}}.<br />
\]<br />
However, absent restrictions on the above expression, this is in general \textit{not} the case. As an example, suppose our hidden layer has as many units as the input layer. Then it is possible to drive the cross-entropy loss to $0$ by copying the input into the hidden layer. In this situation, $q(h) = 1$ for every possible $h$ and $q(h)$ is seen to actually not define a probability distribution. <br />
<br />
If $l(h)$ is indeed to be a negative log-likelihood, i.e.,<br />
\[<br />
l(h) = - \log p(h)<br />
\]<br />
for a genuine probability distribution $p(h)$, it must satisfy<br />
\begin{align*}<br />
l(h) = & - \sum\limits^{D}_{d=1} \log p(h_{d} \mid h_{1:d-1}) \\<br />
= & - \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d}) \log p(h_{d} = 0 \mid h_{1:d-1}) \\<br />
= & - \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d})(1 - \log p(h_{d} = 1 \mid h_{1:d-1})).<br />
\end{align*}<br />
The first equation is just the chain rule of probability<br />
\[<br />
p(h) = \prod\limits^{D}_{d=1} p(h_{d} \mid h_{1:d-1}),<br />
\]<br />
the second equation is true because of our assumption that each entry of $h$ is either $0$ or $1$ and the third equation holds due to the fact that $p(h)$ is a probability distribution. Comparing the naive cross-entropy loss<br />
\[<br />
- \sum\limits h_{d} \log \hat{h}_{d} + (1 - h_{d}) \log (1 - \hat{h}_{d})<br />
\]<br />
with the term<br />
\[<br />
- \sum\limits^{D}_{d=1} h_{d} \log p(h_{d} = 1 \mid h_{1:d-1}) + (1 - h_{d})(1 - \log p(h_{d} = 1 \mid h_{1:d-1})),<br />
\]<br />
we see that a correct reconstruction (``correct" in the sense that the loss function is a negative log-likelihood) needs to satisfy<br />
\[<br />
\hat{h}_{d} = \log p(h_{d} = 1 \mid h_{1:d-1}).<br />
\] <br />
<br />
More generally, for a (deep) autoencoder we can require the reconstructed vector to have components satisfying<br />
\[<br />
\hat{h}_{d} = p(h_{d} \mid h_{1:d-1}).<br />
\]<br />
In other words, the $d^{th}$ component is the probability of observing $h_{d}$ given the preceding components $h_{1:d-1}$. This latter property is known as the \textbf{autoregressive property} since we can think of it as sequentially performing regression on the components of $h$. Unsurprisingly, an autoencoder satisfying the autoregressive property is called an \textbf{autoregressive autoencoder}.<br />
<br />
Suppose now that we have an autoregressive autoencoder which takes an input vector $\mathbf{y} \in \mathbb{R}^{D}$ and we interpret the outputs of this network as parameters for a normal distribution. Write $[\mathbf{\mu}(\mathbf{y}),\mathbf{\sigma}(\mathbf{y})]$ for such output. The autoregressive structure implies that, for $j \in \{1, \ldots, D\}$, $\mathbf{y}_{j}$ depends only on the components $\mathbf{y}_{1:j-1}$. Therefore, if we take the vector $[\mathbf{\mu}_{i}, \mathbf{\sigma}_{i}]$ and compute the derivative with respect to $\mathbf{y}$, we will obtain a lower triangular matrix since<br />
\[<br />
\frac{\partial [\mathbf{\mu}_{i}, \mathbf{\sigma}_{i}]}{\partial \mathbf{y}_{j}} = [0, 0]<br />
\]<br />
whenever $j \geq i$. We interpret the vector $[\mathbf{\mu}_{i}(\mathbf{y}_{1:j-1}), \mathbf{\sigma}_{i}(\mathbf{y}_{1:j-1})]$ as being the predicted mean and standard deviation of the $i^{th}$ element of (the reconstruction of) $\mathbf{y}$. In slightly more detail, the components of $\mathbf{y}$ are successively generated via<br />
\begin{align*}<br />
& \mathbf{y}_{0} = \mathbf{\mu}_{0} + \mathbf{\sigma}_{0} \cdot \mathbf{\epsilon}_{0}, \\<br />
& \mathbf{y}_{i} = \mathbf{\mu}_{i}(\mathbf{y}_{1:i-1}) + \mathbf{\sigma}_{i}(\mathbf{y}_{1:i-1}) \cdot \mathbf{\epsilon}_{i},<br />
\end{align*}<br />
where $\mathbf{\epsilon} \!\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. <br />
<br />
To relate this back to the normalizing flow chosen by the authors of \cite{946paper}, replace $\mathbf{y}$ with $h_{k}$ as input to the autoregressive autoencoder and replace the outputs $\mathbf{\mu}$, $\mathbf{\sigma}$ with $\mu_{k}$, $\sigma_{k}$.<br />
<br />
== Implementation == <br />
[[File:implementation.PNG]]<br />
<br />
== Experiments==<br />
The authors showcase the potential for improvement of variational autoencoders afforded by the novel IAF methodology they proposed by providing empirical evidence in the context of two key benchmark datasets: MNIST and CIFAR 10.<br />
<br />
'''MNIST DATASET:'''<br />
The results of the experiment authors conducted are displayed in the figure below. [[File:IAF MINST.PNG]]<br />
<br />
They provide some evidence as to the improvement of the variational inference via IAF when compared with other state of the art techniques.<br />
<br />
'''CIFAR 10:'''<br />
<br />
The results for the CIFAR 10 data set is displayed in the figure below. [[File:CIFAR 10.PNG]]<br />
<br />
==Concluding Remarks==<br />
<br />
In wrapping up, we note that there is something interesting about how the normalizing flow is derived. Essentially, the authors of \cite{946paper} took a neural network model with nice properties (fast sampling, simple Jacobian, etc.), looked at the function it implemented and basically dropped in this function in the recursive definition of the normalizing flow. This is not an isolated case. The authors of \cite{normalizing_flow} do much the same thing in coming up with the flow <br />
\[<br />
f_{k}(h_{k}) = h_{k} + u_{k}s(w_{k}^{T}h_{k} + b_{k}).<br />
\]<br />
We believe that this flow is implicitly justified by the fact that functions of the above form are implemented by deep latent Gaussian models (see \cite{autoencoderRezende}). These flows, while interesting and useful, probably do not exhaust the possibilities for tractable and practical normalizing flows. It may be an interesting project to try and come up with novel normalizing flows by taking a favorite neural network architecture and using the function implemented by it as a flow. Additionally, it may be worth exploring boutique normalizing flows to improve variational inference in domain-specific settings (e.g., use a normalizing flow induced by a parsimonious convolutional neural network architecture for training an image-processing model using variational inference).<br />
<br />
== List of Figures == <br />
[[File:fig1.PNG]]<br />
[[File:fig2.PNG]]<br />
[[File:fig3.PNG]]<br />
<br />
==References==<br />
<br />
<references/></div>Jdenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/_Improved_Variational_Inference_with_Inverse_Autoregressive_Flow&diff=28479STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow2017-10-24T20:30:18Z<p>Jdeng: /* Black-box Variational Inference */</p>
<hr />
<div>==Introduction==<br />
<br />
One of the most common ways to formalize machine learning models is through the use of $\textbf{latent variable models}$, wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some $\textit{hidden variables}$. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. The $ \textbf{mixture of Gaussians}$ provides a particularly nice example of a latent variable model. One way to think about a mixture with $K$ Gaussians is as follows. First, roll a $K$-sided die and suppose that the result is $k$ with probabililty $\pi_{k}$. Then randomly generate a point from the Gaussian distribution with parameters $\mu_{k}$ and $\Sigma_{k}$. The reason this is a hidden variable model is that, when we have a dataset coming from a mixture of Gaussians, we only get to see the datapoints that are generated at the end. For a given observed datapoint we neither get to see the die that is rolled in generating that point nor do we know what the probabilities $\pi_{k}$ are. The $\pi_{k}$ are therefore hidden variables and, together with estimation of the parameters $\mu_{k}$, $\Sigma_{k}$ determining observations, estimating the $\pi_{k}$ constitutes inference within the mixture of Gaussians model. Note that all the parameters to be estimated can be wrapped into a long vector $\theta = (\pi_{1}, \ldots, \pi_{K}, \mu_{1}, \Sigma_{1}, \ldots, \mu_{K}, \Sigma_{K})$.<br />
<br />
More generally, latent variable models provide a powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. Therefore, it behoves us to obtain general methods which allow tractable inference within latent variable models. One such method is known as $\textbf{variational inference}$ and it, in its modern form, was introduced to machine learning around two decades ago in the seminal paper [jordanVI]. More recently, and more apropos of deep learning, stochastic versions of variational inference are being combined with neural networks to provide robust estimation of parameters in probabilistic models. The original impetus for this fusion apparently stems from publication of [autoencoderKingma] and [autoencoderRezende]. In the interim, a cottage industry for application of stochastic variational inference or methods related to it have seemingly sprung up, especially as witnessed by the variety of autoencoders currently being sold at the bazaar. The paper [946paper] represents another interesting contribution in parameter estimation by way of deep learning. Note that, at time of writing, variational methods are being applied to a wide range of problems in machine learning and we will only develop the small part of it necessary for our purposes. But refer to [VISurvey] for a survey.<br />
<br />
==Black-box Variational Inference==<br />
<br />
The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, often called the \textbf{generative model} in the literature, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. This assumption often holds in real-world applications since generative models for real phenomena are extremely difficult or impossible to integrate. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables and, by Bayes' rule, this requires computation of the marginal distribution $p_{\theta}(x)$. <br />
<br />
The variational inference approach entails positing a parametric family $q_{\phi}(h\mid x)$, also called the \textbf{inference model}, of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, we can think of a slightly indirect approach. We can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(h\mid x)$, we have<br />
\begin{align*}<br />
\log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\<br />
&= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\<br />
&= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\<br />
&= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\<br />
&= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&:= \mathcal{L}(x, \theta, \phi),<br />
\end{align*}<br />
where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textbf{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. <br />
<br />
Before moving on, we note that there are alternative ways of expressing the ELBO which can either provide insights or aid us in further calculations. For one alternative form, note that we can massage the ELBO like so.<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) + \log p(x) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p(x) \right] - \mathbb{E}_{q} \left[ \log q_{\phi}(h \mid x) - \log p_{\theta}(h \mid x) \right] \\<br />
= & \log p(x) - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p(h \mid x) \right].<br />
\end{align*}<br />
The last expression has a very simple interpretation: maximizing $\mathcal{L}(x, \theta, \phi)$ is equivalent to minimizing the KL divergence between the approximate posterior $q_{\phi}$ and the actual posterior $p_{\theta}(h \mid x)$. In fact, we can rewrite the above equation as a ``conservation law"<br />
\[<br />
\mathcal{L}(x, \theta, \phi) + \mathrm{KL} \left[ q_{\phi}(h \mid x) || p(h \mid x) \right] = \log p(x).<br />
\]<br />
On the other hand, we can also do<br />
\begin{align*}<br />
\mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) + \log p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h) \right] \\<br />
= & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right] - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p_{\theta}(h) \right].<br />
\end{align*} <br />
and the hermeneutics here is a bit more interesting. Recall that $q_{\phi}(h \mid x)$ is a distribution we get to choose and choosing a ``good" distribution means choosing something which we believe is faithful to the way observations get ``encoded" or ``compressed" into ``hidden representations". Conversely, $p_{\theta}(x \mid h)$ may be thought as a ``decoder" which unpacks latent ``codes" into observations. Thus, we can think of $\mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right]$ as the expected reconstruction error when we use $q_{\phi}$ as an encoder. The KL term is now interpreted as a regularizer which restricts divergence of the encoder from the prior distribution over latent codes. Note that these remarks simply provide an intuition and even though we use descriptors such as ``encoder" and ``decoder", there is no \textit{a priori} reason to implement the distributions involved as encoder and decoder networks as in an autoencoder. Indeed, there is nothing preventing us from even letting $q_{\phi}$ compute an ``overcomplete" feature representation $h$ of $x$ (i.e., dimensionality of $h$ is greater than that of $x$).<br />
<br />
Regardless of which ELBO we use, inference requires the gradients of $\mathcal{L}(x, \theta, \phi)$. Notice that, no matter what, there are expectations with respect to $q_{\phi}$ involved and the presence of these expectations persists into the gradients. As an example, let us compute the gradients with the ELBO written as <br />
\[<br />
\mathcal{L}(x, \theta, \phi) = \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right].<br />
\]<br />
The gradient with respect to $\theta$ is easy.<br />
\begin{align*}<br />
\nabla_{\theta}\mathcal{L}(x,\theta,\phi) &= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\theta}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\theta}\int_{h}\left[q_{\phi}(h\mid x)\log p_{\theta}(x,h)\right] \\<br />
&= \int_{h}q_{\phi}(h\mid x)\nabla_{\theta}\log p_{\theta}(x,h) \\<br />
&= \mathbb{E}_{q}\left[\nabla_{\theta}\log p_{\theta}(x,h)\right].<br />
\end{align*} <br />
For the derivative with respect to the variational parameters $\phi$, we are going to exploit the identities $$\int_{h}\nabla_{\phi} q_{\phi}(h\mid x)=\nabla_{\phi}\int_{h}q_{\phi}(h\mid x)=\nabla_{\phi}1=0$$ and $$q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x)=\nabla_{\phi}q_{\phi}(h\mid x).$$ Note that the second identity will be used in a ``backwards" direction toward the end of the derivation below. We now have<br />
\begin{align*}<br />
\nabla_{\phi}\mathcal{L}(x, \theta, \phi) &= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h) -\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\mathbb{E}_{q}\left[\log p_{\theta}(x,h)\right] - \nabla_{\phi}\mathbb{E}_{q}\left[\log q_{\phi}(h\mid x)\right] \\<br />
&= \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log p_{\theta}(x,h) - \nabla_{\phi}\int_{h}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x)\log q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(q_{\phi}(h\mid x)\frac{\nabla_{\phi}q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\left(\nabla_{\phi}q_{\phi}(h\mid x) + \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\nabla_{\phi}q_{\phi}(h\mid x) - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - 0 - \int_{h}\log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h)\nabla_{\phi}q_{\phi}(h\mid x) - \log q_{\phi}(h\mid x)\nabla_{\phi}q_{\phi}(h\mid x) \right) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}q_{\phi}(h\mid x) \\<br />
&= \int_{h}\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)q_{\phi}(h\mid x)\nabla_{\phi}\log q_{\phi}(h\mid x) \\<br />
&= \mathbb{E}_{q}\left[\left(\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right)\nabla_{\phi}\log q_{\phi}(h\mid x)\right].<br />
\end{align*}<br />
<br />
Observe that everything we have done so far is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led Ranganath et al. <ref name="bbvi"> Rajesh Ranganath, Sean Gerrish and David M. Blei. Black Box Variational Inference. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2014, Reykjavik, Iceland, April 22-25, 2014</ref> \cite{bbvi} to christen it \textbf{black-box variational inference}. The price we pay for such generality is that we have to calculate expectations against the distribution $q_{\phi}$. Broadly speaking, such expectations are intractable integrals for any approximate posterior approaching verisimilitude. However, the notion of \textbf{normalizin