http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=F7xia&feedformat=atomstatwiki - User contributions [US]2020-04-04T07:11:37ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36475One-Shot Imitation Learning2018-04-21T04:27:46Z<p>F7xia: /* Temporal Dropout */</p>
<hr />
<div>= Introduction =<br />
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time<br />
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discards a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36474One-Shot Imitation Learning2018-04-21T04:26:52Z<p>F7xia: /* Algorithm */</p>
<hr />
<div>= Introduction =<br />
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time<br />
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36473One-Shot Imitation Learning2018-04-21T04:26:44Z<p>F7xia: /* Algorithm */</p>
<hr />
<div>= Introduction =<br />
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time<br />
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Imitation_Learning&diff=36472One-Shot Imitation Learning2018-04-21T04:26:26Z<p>F7xia: /* Criticisms */</p>
<hr />
<div>= Introduction =<br />
We are interested in robotic systems that are able to perform a variety of complex useful tasks. Robotic systems can be used for many applications, but to truly be useful for complex applications, they need to overcome 2 challenges: having the intent of the task at hand communicated to them, and being able to perform the manipulations necessary to complete this task. It is preferable to use demonstration to teach the robotic systems rather than natural language, as natural language may often fail to convey the details and intricacies required for the task. However, current work on learning from demonstrations is only successful with large amounts of feature engineering or a large number of demonstrations. The proposed model aims to achieve 'one-shot' imitation learning, ie. learning to complete a new task from just a single demonstration of it without any other supervision. As input, the proposed model takes the observation of the current instance of a task, and a demonstration of successfully solving a different instance of the same task. Strong generalization was achieved by using a soft attention mechanism on both the sequence of actions and states that the demonstration consists of, as well as on the vector of element locations within the environment. The success of this proposed model at completing a series of block stacking tasks can be viewed at http://bit.ly/nips2017-oneshot.<br />
<br />
= Related Work =<br />
While one-shot imitation learning is a novel combination of ideas, each of the components has previously been studied.<br />
* Imitation Learning: <br />
** Behavioural learning uses supervised learning to map from observations to actions (e.g. [https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf (Pomerleau 1988)], [https://arxiv.org/pdf/1011.0686.pdf (Ross et. al 2011)])<br />
** Inverse reinforcement learning estimates a reward function that considers demonstrations as optimal behavior (e.g. [http://ai.stanford.edu/~ang/papers/icml00-irl.pdf (Ng et. al 2000)])<br />
* One-Shot Learning: is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few , training images.<br />
** Typically a form of meta-learning<br />
** Previously used for variety of tasks but all domain-specific<br />
** [https://arxiv.org/abs/1703.03400 (Finn et al. 2017)] proposed a generic solution but excluded imitation learning<br />
* Reinforcement Learning:<br />
** Demonstrated to work on variety of tasks and environments, in particular on games and robotic control<br />
** Requires large amount of trials and a user-specified reward function<br />
* Multi-task/Transfer Learning:<br />
** Shown to be particularly effective at computer vision tasks<br />
** Not meant for one-shot learning<br />
* Attention Modelling:<br />
** The proposed model makes use of the attention model from [https://arxiv.org/abs/1409.0473 (Bahdanau et al. 2016)]<br />
** The attention modelling over demonstration is similar in nature to the seq2seq models from the well known [https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (Sutskever et al. 2014)]<br />
<br />
= One-Shot Imitation Learning =<br />
<br />
[[File:oneshot1.jpg|1000px]]<br />
<br />
The figure above shows the differences between traditional and one-shot imitation learning. In a), the traditional method may require training different policies for performing similar tasks that are similar in nature. For example, stacking blocks to a height of 2 and to a height of 3. In b), the one-shot imitation learning allows the same policy to be used for these tasks given a single demonstration, achieving good performance without any additional system interactions. In c), the policy is trained by using a set of different training tasks, with enough examples so that the learned results can be generalized to other similar tasks. Each task has a set of successful demonstrations. Each iteration of training uses two demonstrations from a task, one is used as the input passing into the algorithm and the other is used at the output, the results from the two are then conditioned to produce the correct action.<br />
<br />
== Problem Formalization ==<br />
The problem is briefly formalized with the authors describing a distribution of tasks, an individual task, a distribution of demonstrations for this task, and a single demonstration respectively as \[T, \: t\sim T, \: D(t), \: d\sim D(t)\]<br />
In addition, an action, an observation, parameters, and a policy are respectively defined as \[a, o, \theta, \pi_\theta(a|o,d)\]<br />
In particular, a demonstration is a sequence of observation and action pairs \[d = [(o_1, a_1),(o_2, a_2), . . . ,(o_H , a_H )]\]<br />
Assuming that <math>H </math>, the length or horizon of a demonstration, and some evaluation function $$R_t(d): R^H \rightarrow R$$ are given, and that succesful demonstrations are available for each task, then the objective is to maximize expectation of the policy performance over \[t\sim T, d\sim D(t)\].<br />
<br />
== Block Stacking Tasks ==<br />
The tasks that the authors focus on is block stacking. A user specifies in what final configuration cubic blocks should be stacked, and the goal is to use a 7-DOF Fetch robotic arm to arrange the blocks in this configuration. The number of blocks, and their desired configuration (ie. number of towers, the height of each tower, and order of blocks within each tower) can be varied and encoded as a string. For example, 'abc def' would signify 2 towers of height 3, with block A on block B on block C in one tower, and block D on block E on block F in a second tower. To add complexity, the initial configuration of the blocks can vary and is encoded as a set of 3-dimensional vectors describing the position of each block relative to the robotic arm.<br />
<br />
== Algorithm ==<br />
To avoid needing to specify a reward function, the authors use behavioral cloning and DAGGER, 2 imitation learning methods that require only demonstrations, for training. In each training step, a list of tasks is sampled, and for each, a demonstration with injected noise along with some observation-action pairs are sampled. Given the current observation and demonstration as input, the policy is trained against the sampled actions by minimizing L2 norm for continuous actions, and cross-entropy for discrete ones. Adamax is used as the optimizer with a learning rate of 0.001.<br />
<br />
= Architecture =<br />
The authors propose a novel architecture for imitation learning, consisting of 3 networks.<br />
<br />
While, in principle, a generic neural network could learn the mapping from demonstration and current observation to appropriate action, the authors propose the following architecture which they claim as one of the main contributions of this paper, and believe it would be useful for complex tasks in the future.<br />
The proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.<br />
<br />
[[File:oneshot2.jpg|1000px|center]]<br />
<br />
== Demonstration Network ==<br />
This network takes a demonstration as input and produces an embedding with size linearly proportional to the number of blocks and the size of the demonstration.<br />
=== Temporal Dropout ===<br />
Since a demonstration for block stacking can be very long, the authors randomly discard 95% of the time steps, a process they call 'temporal dropout'. The reduced size of the demonstrations allows multiple trajectories to be explored during testing to calculate an ensemble estimate. Dilated temporal convolutions and neighborhood attention are then repeatedly applied to the downsampled demonstrations. For block stacking project, the demonstrations can span hundreds to thousands of time<br />
steps, and training with such long sequences can be demanding in both time and memory usage. Hence, the author randomly discard a subset of time steps during training, such operation is called "temporal dropout". Denote p as the proportion of time steps that are thrown away (in this case p = 95%).<br />
<br />
=== Neighborhood Attention ===<br />
Since demonstration sizes can vary, a mechanism is needed that is not restricted to fixed-length inputs. While soft attention is one such mechanism, the problem with it is that there may be increasingly large amounts of information lost if soft attention is used to map longer demonstrations to the same fixed length as shorter demonstrations. As a solution, the authors propose having the same number of outputs as inputs, but with attention performed on other inputs relative to the current input.<br />
<br />
A query <math>q</math>, a list of context vectors <math>\{c_j\}</math>, and a list of memory vectors <math>\{m_j\}</math> are given as input to soft attention. Each attention weight is given by the product of a learned weight vector and a nonlinearity applied to the sum of the query and corresponding context vector. Softmaxed weights applied to the corresponding memory vector form the output of the soft attention.<br />
<br />
\[Inputs: q, \{c_j\}, \{m_j\}\]<br />
\[Weights: w_i \leftarrow v^Ttanh(q+c_i)\]<br />
\[Output: \sum_i{m_i\frac{\exp(w_i)}{\sum_j{\exp(w_j)}}}\]<br />
<br />
A list of same-length embeddings, coming from a previous neighbourhood attention layer or a projection from the list of block coordinates, is given as input to neighborhood attention. For each block, two separate linear layers produce a query vector and a context vector, while a memory vector is a list of tuples that describe the position of each block joined with the input embedding for that block. Soft attention is then performed on this query, context vector, and memory vector. The authors claim that the intuition behind this process is to allow each block to provide information about itself relative to the other blocks in the environment. Finally, for each block, a linear transformation is performed on the vector composed by concatenating the input embedding, the result of the soft attention for that block, and the robot's state.<br />
<br />
For an environment with B blocks:<br />
\[State: s\]<br />
\[Block_i: b_i \leftarrow (x_i, y_i, z_i)\]<br />
\[Embeddings: h_1^{in}, ..., h_B^{in}\] <br />
\[Query_i: q_i \leftarrow Linear(h_i^{in})\]<br />
\[Context_i: c_i \leftarrow Linear(h_i^{in})\]<br />
\[Memory_i: m_i \leftarrow (b_i, h_i^{in}) \]<br />
\[Result_i: result_i \leftarrow SoftAttn(q_i, \{c_j\}_{j=1}^B, \{m_k\}_{k=1}^B)\]<br />
\[Output_i: output_i \leftarrow Linear(concat(h_i^{in}, result_i, b_i, s))\]<br />
<br />
== Context network ==<br />
This network takes the current state and the embedding produced by the demonstration network as inputs and outputs a fixed-length "context embedding" which captures only the information relevant for the manipulation network at this particular step.<br />
=== Attention over demonstration ===<br />
The current state is used to compute a query vector which is then used for attending over all the steps of the embedding. Since at each time step there are multiple blocks, the weights for each are summed together to produce a scalar for each time step. Neighbourhood attention is then applied several times, using an LSTM with untied weights, since the information at each time steps needs to be propagated to each block's embedding. <br />
<br />
Performing attention over the demonstration yields a vector whose size is independent of the demonstration size; however, it is still dependent on the number of blocks in the environment, so it is natural to now attend over the state in order to get a fixed-length vector.<br />
=== Attention over current state ===<br />
The authors propose that in general, within each subtask, only a limited number of blocks are relevant for performing the subtask. If the subtask is to stack A on B, then intuitively, one would suppose that only block A and B are relevant, and perhaps any blocks that may be blocking access to either A or B. This is not enforced during training, but once soft attention is applied to the current state to produce a fixed-length context embedding, the authors believe that the model does indeed learn in this way.<br />
<br />
== Manipulation network ==<br />
Given the context embedding as input, this simple feedforward network decides on the particular action needed, to complete the subtask of stacking one particular 'source' block on top of another 'target' block. The manipulation network uses an MLP network. Since the network in the paper can only takes into account the source and target block it may take subobtimal paths. For example changing [ABC, D] to [C, ABD] can be done in one motion if it was possible to manipulate two blocks at once. The manipulation network is the simplest part of the network and leaves room to expand upon in future work.<br />
<br />
= Experiments = <br />
The proposed model was tested on the block stacking tasks. the experiments were designed at answering the following questions:<br />
* How does training with behavioral cloning compare with DAGGER?<br />
* How does conditioning on the entire demonstration compare to conditioning on the final state?<br />
* How does conditioning on the entire demonstration compare to conditioning on a “snapshot” of the trajectory?<br />
* Can the authors' framework generalize to tasks that it has never seen during training?<br />
For the experiments, 140 training tasks and 43 testing tasks were collected, each with between 2 to 10 blocks and a different, desired final layout. Over 1000 demonstrations for each task were collected using a hard-coded policy rather than a human user. The authors compare 4 different architectures in these experiments:<br />
* Behavioural cloning used to train the proposed model<br />
* DAGGER used to train the proposed model<br />
* The proposed model, trained with DAGGER, but conditioned on the desired final state rather than an entire demonstration<br />
* The proposed model, trained with DAGGER, but conditioned on a 'snapshot' of the environment at the end of each subtask (ie. every time a block is stacked on another block)<br />
<br />
== Performance Evaluation ==<br />
[[File:oneshot3.jpg|1000px]]<br />
<br />
The most confident action at each timestep is chosen in 100 different task configurations, and results are averaged over tasks that had the same number of blocks. The results suggest that the performance of each of the architectures is comparable to that of the hard-coded policy which they aim to imitate. Performance degrades similarly across all architectures and the hard-coded policy as the number of blocks increases. On the harder tasks, conditioning on the entire demonstration led to better performance than conditioning on snapshots or on the final state. The authors believe that this may be due to the lack of information when conditioning only on the final state as well as due to regularization caused by temporal dropout which leads to data augmentation when conditioning on the full demonstration but is omitted when conditioning only on the snapshots or final state. Both DAGGER and behavioral cloning performed comparably well. As mentioned above, noise injection was used in training to improve performance; in practice, additional noise can still be injected but some may already come from other sources.<br />
<br />
== Visualization ==<br />
The authors visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention that the authors are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. The figures below show some of the policy attention heatmaps over time.<br />
<br />
[[File:paper6_Visualization.png|800px]]<br />
<br />
= Conclusions =<br />
The proposed model successfully learns to complete new instances of a new task from just a single demonstration. The model was demonstrated to work on a series of block stacking tasks. The authors propose several extensions including enabling few-shot learning when one demonstration is insufficient, using image data as the demonstrations, and attempting many other tasks aside from block stacking.<br />
<br />
= Criticisms =<br />
While the paper shows an incredibly impressive result: the ability to learn a new task from just a single demonstration, there are a few points that need clearing up.<br />
Firstly, the authors use a hard-coded policy in their experiments rather than a human. It is clear that the performance of this policy begins to degrade quickly as the complexity of the task increases. It would be useful to know what this hard-coded policy actually was, and if the proposed model could still have comparable performance if a more successful demonstration, perhaps one by a human user, were performed. Give the current popularity of adversarial examples, it would also be interesting to see the performance when conditioned on an "adversarial" demonstration, that achieves the correct final state, but intentionally performs complex or obfuscated steps to get there.<br />
Second, it would be useful to see the model's performance on a more complex family of tasks than block stacking, since although each block stacking task is slightly different, the differences may turn out be insignificant compared to other tasks that this model should work on if it is to be a general imitation learning architecture; intuitively, the space of all possible moves and configurations is not large for the task. Also it is a bit misleading as there seems to be a need for more demonstrations to first get a reasonable policy that can generalize, leading to generic policy and then use just one demonstration on a new task expecting the policy to generalize. So it seems there is some sort of pre-training involved here. Regardless, this work is a big step forward for imitation learning, permitting a wider range of tasks for which there is little training data and no reward function available, to still be successfully solved.<br />
<br />
= Illustrative Example: Particle Reaching =<br />
<br />
[[File:f1.png]]<br />
<br />
Figure 1: [Left] Agent, [Middle] Orange square is target, [Right] Green triangle is target [2].<br />
<br />
Another simple yet insightful example of the One-Shot Imitation Learning is the particle reaching problem which provides a relatively simple suite of tasks from which the network needs to solve an arbitrary one. The problem is formulated such that for each task: there is an agent which can move based on a 2D force vector, and n landmarks at varying 2D locations (n varies from task to task) with the goal of moving the agent to the specific landmark reached in the demonstration. This is illustrated in Figure 1. <br />
<br />
[[File:f2.png|450px]]<br />
<br />
Figure 2: Experimental results [2].<br />
<br />
Some insight comes from the use of different network architectures to solve this problem. The three architectures to compare (described below) are plain LSTM, LSTM with attention, and final state with attention. The key insight is that the architectures go from generic to specific, with the best generalization performance achieved with the most specific architecture, final state with attention, as seen in Figure 2. It is important to note that this conclusion does not carry forward to more complicated tasks such as the block stacking task.<br />
*Plain LSTM: 512 hidden units, with the input being the demonstration trajectory (the position of the agent changes over time and approaches one of the targets). Output of the LSTM with the current state (from the task needed to be solved) is the input for a multi-layer perceptron (MLP) for finding the solution.<br />
*LSTM with attention: Output of LSTM is now a set of weights for the different targets during training. These weights and the test state are used in the test task. The, now, 2D output is the input for an MLP as before.<br />
*Final state with attention: Looks only at the final state of the demonstration since it can sufficiently provide the needed detail of which target to reach (trajectory is not required). Similar to previous architecture, produces weights used by MLP.<br />
<br />
= Source =<br />
# Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).<br />
# Duan, Yan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. "One-shot imitation learning." In Advances in neural information processing systems, pp. 1087-1098. 2017.<br />
# Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. arXiv preprint arXiv:1703.07326, 2017. (Newer revision)<br />
# Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).<br />
# Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/IMPROVING_GANS_USING_OPTIMAL_TRANSPORT&diff=36471stat946w18/IMPROVING GANS USING OPTIMAL TRANSPORT2018-04-21T04:23:43Z<p>F7xia: /* Discussion */</p>
<hr />
<div>== Introduction ==<br />
Recently, the problem of how to learn models that generate media such as images, video, audio and text has become very popular and is called Generative Modeling. One of the main benefits of such an approach is that generative models can be trained on unlabeled data that is readily available . Therefore, generative networks have a huge potential in the field of deep learning.<br />
<br />
Generative Adversarial Networks (GANs) are powerful generative models used for unsupervised learning techniques where the 2 agents compete to generate a zero-sum model. A GAN model consists of a generator and a discriminator or critic. The generator is a neural network which is trained to generate data having a distribution matched with the distribution of the real data. The critic is also a neural network, which is trained to separate the generated data from the real data. A loss function that measures the distribution distance between the generated data and the real one is important to train the generator.<br />
<br />
Optimal transport theory, which is another approach to measuring distances between distributions, evaluates the distribution distance between the generated data and the training data based on a metric, which provides another method for generator training. The main advantage of optimal transport theory over the distance measurement in GAN is its closed form solution for having a tractable training process. But the theory might also result in inconsistency in statistical estimation due to the given biased gradients if the mini-batches method is applied (Bellemare et al.,<br />
2017).<br />
<br />
This paper presents a variant GANs named OT-GAN, which incorporates a discriminative metric called 'Mini-batch Energy Distance' into its critic in order to overcome the issue of biased gradients.<br />
<br />
== GANs and Optimal Transport ==<br />
<br />
===Generative Adversarial Nets===<br />
Original GAN was firstly reviewed. The objective function of the GAN: <br />
<br />
[[File:equation1.png|700px]]<br />
<br />
The goal of GANs is to train the generator g and the discriminator d finding a pair of (g,d) to achieve Nash equilibrium(such that either of them cannot reduce their cost without changing the others' parameters). However, it could cause failure of converging since the generator and the discriminator are trained based on gradient descent techniques.<br />
<br />
===Wasserstein Distance (Earth-Mover Distance)===<br />
<br />
In order to solve the problem of convergence failure, Arjovsky et. al. (2017) suggested Wasserstein distance (Earth-Mover distance) based on the optimal transport theory.<br />
<br />
[[File:equation2.png|600px]]<br />
<br />
where <math> \prod (p,g) </math> is the set of all joint distributions <math> \gamma (x,y) </math> with marginals <math> p(x) </math> (real data), <math> g(y) </math> (generated data). <math> c(x,y) </math> is a cost function and the Euclidean distance was used by Arjovsky et. al. in the paper. <br />
<br />
The Wasserstein distance can be considered as moving the minimum amount of points between distribution <math> g(y) </math> and <math> p(x) </math> such that the generator distribution <math> g(y) </math> is similar to the real data distribution <math> p(x) </math>.<br />
<br />
Computing the Wasserstein distance is intractable. The proposed Wasserstein GAN (W-GAN) provides an estimated solution by switching the optimal transport problem into Kantorovich-Rubinstein dual formulation using a set of 1-Lipschitz functions. A neural network can then be used to obtain an estimation.<br />
<br />
[[File:equation3.png|600px]]<br />
<br />
W-GAN helps to solve the unstable training process of original GAN and it can solve the optimal transport problem approximately, but it is still intractable.<br />
<br />
===Sinkhorn Distance===<br />
Genevay et al. (2017) proposed to use the primal formulation of optimal transport instead of the dual formulation to generative modeling. They introduced Sinkhorn distance which is a smoothed generalization of the Wasserstein distance.<br />
[[File: equation4.png|600px]]<br />
<br />
It introduced entropy restriction (<math> \beta </math>) to the joint distribution <math> \prod_{\beta} (p,g) </math>. This distance could be generalized to approximate the mini-batches of data <math> X ,Y</math> with <math> K </math> vectors of <math> x, y</math>. The <math> i, j </math> th entry of the cost matrix <math> C </math> can be interpreted as the cost it needs to transport the <math> x_i </math> in mini-batch X to the <math> y_i </math> in mini-batch <math>Y </math>. The resulting distance will be:<br />
<br />
[[File: equation5.png|550px]]<br />
<br />
where <math> M </math> is a <math> K \times K </math> matrix, each row of <math> M </math> is a joint distribution of <math> \gamma (x,y) </math> with positive entries. The summmation of rows or columns of <math> M </math> is always equal to 1. <br />
<br />
This mini-batch Sinkhorn distance is not only fully tractable but also capable of solving the instability problem of GANs. However, it is not a valid metric over probability distribution when taking the expectation of <math> \mathcal{W}_{c} </math> and the gradients are biased when the mini-batch size is fixed.<br />
<br />
===Energy Distance (Cramer Distance)===<br />
In order to solve the above problem, Bellemare et al. proposed Energy distance:<br />
<br />
[[File: equation6.png|700px]]<br />
<br />
where <math> x, x' </math> and <math> y, y'</math> are independent samples from data distribution <math> p </math> and generator distribution <math> g </math>, respectively. Based on the Energy distance, Cramer GAN is to minimize the ED distance metric when training the generator.<br />
<br />
==Mini-Batch Energy Distance==<br />
Salimans et al. (2016) mentioned that comparing to use distributions over individual images, mini-batch GAN is more powerful when using the distributions over mini-batches <math> g(X), p(X) </math>. The distance measure is generated for mini-batches.<br />
<br />
===Generalized Energy Distance===<br />
The generalized energy distance allowed to use non-Euclidean distance functions d. It is also valid for mini-batches and is considered better than working with individual data batch.<br />
<br />
[[File: equation7.png|670px]]<br />
<br />
Similarly as defined in the Energy distance, <math> X, X' </math> and <math> Y, Y'</math> can be the independent samples from data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. While in Generalized engergy distance, <math> X, X' </math> and <math> Y, Y'</math> can also be valid for mini-batches. The <math> D_{GED}(p,g) </math> is a metric when having <math> d </math> as a metric. Thus, taking the triangle inequality of <math> d </math> into account, <math> D(p,g) \geq 0,</math> and <math> D(p,g)=0 </math> when <math> p=g </math>.<br />
<br />
===Mini-Batch Energy Distance===<br />
As <math> d </math> is free to choose, authors proposed Mini-batch Energy Distance by using entropy-regularized Wasserstein distance as <math> d </math>. <br />
<br />
[[File: equation8.png|650px]]<br />
<br />
where <math> X, X' </math> and <math> Y, Y'</math> are independent sampled mini-batches from the data distribution <math> p </math> and the generator distribution <math> g </math>, respectively. This distance metric combines the energy distance with primal form of optimal transport over mini-batch distributions <math> g(Y) </math> and <math> p(X) </math>. Inside the generalized energy distance, the Sinkhorn distance is a valid metric between each mini-batches. By adding the <math> - \mathcal{W}_c (Y,Y')</math> and <math> \mathcal{W}_c (X,Y)</math> to equation (5) and using energy distance, the objective becomes statistically consistent (meaning the objective converges to the true parameter value for large sample sizes) and mini-batch gradients are unbiased.<br />
<br />
==Optimal Transport GAN (OT-GAN)==<br />
<br />
The mini-batch energy distance which was proposed depends on the transport cost function <math>c(x,y)</math>. One possibility would be to choose c to be some fixed function over vectors, like Euclidean distance, but the authors found this to perform poorly in preliminary experiments. For simple fixed cost functions like Euclidean distance, there exists many bad distributions <math>g</math> in higher dimensions for which the mini-batch energy distance is zero such that it is difficult to tell <math>p</math> and <math>g</math> apart if the sample size is not big enough. To solve this the authors propose learning the cost function adversarially, so that it can adapt to the generator distribution <math>g</math> and thereby become more discriminative. <br />
<br />
In practice, in order to secure the statistical efficiency (i.e. being able to tell <math>p</math> and <math>g</math> apart without requiring an enormous sample size when their distance is close to zero), authors suggested using cosine distance between vectors <math> v_\eta (x) </math> and <math> v_\eta (y) </math> based on the deep neural network that maps the mini-batch data to a learned latent space. Here is the transportation cost:<br />
<br />
[[File: euqation9.png|370px]]<br />
<br />
where the <math> v_\eta </math> is chosen to maximize the resulting minibatch energy distance.<br />
<br />
Unlike the practice when using the original GANs, the generator was trained more often than the critic, which keep the cost function from degeneration. The resulting generator in OT-GAN has a well defined and statistically consistent objective through the training process.<br />
<br />
The algorithm is defined below. The backpropagation is not used in the algorithm since ignoring this gradient flow is justified by the envelope theorem (i.e. when changing the parameters of the objective function, changes in the optimizer do not contribute to a change in the objective function). Stochastic gradient descent is used as the optimization method in algorithm 1 below, although other optimizers are also possible. In fact, Adam was used in experiments. <br />
<br />
[[File: al.png|600px]]<br />
<br />
<br />
[[File: al_figure.png|600px]]<br />
<br />
==Experiments==<br />
<br />
In order to demonstrate the supermum performance of the OT-GAN, authors compared it with the original GAN and other popular models based on four experiments: Dataset recovery; CIFAR-10 test; ImageNet test; and the conditional image synthesis test.<br />
<br />
===Mixture of Gaussian Dataset===<br />
OT-GAN has a statistically consistent objective when it is compared with the original GAN (DC-GAN), such that the generator would not update to a wrong direction even if the signal provided by the cost function to the generator is not good. In order to prove this advantage, authors compared the OT-GAN with the original GAN loss (DAN-S) based on a simple task. The task was set to recover all of the 8 modes from 8 Gaussian mixers in which the means were arranged in a circle. MLP with RLU activation functions were used in this task. The critic was only updated for 15K iterations. The generator distribution was tracked for another 25K iteration. The results showed that the original GAN experiences the model collapse after fixing the discriminator while the OT-GAN recovered all the 8 modes from the mixed Gaussian data.<br />
<br />
[[File: 5_1.png|600px]]<br />
<br />
===CIFAR-10===<br />
<br />
The dataset CIFAR-10 was then used for inspecting the effect of batch-size to the model training process and the image quality. OT-GAN and four other methods were compared using "inception score" as the criteria for comparison. Figure 3 shows the change of inceptions scores (y-axis) by the increased of the iteration number. Scores of four different batch sizes (200, 800, 3200 and 8000) were compared. The results show that a larger batch size, which would more likely cover more modes in the distribution of data, lead to a more stable model showing a larger value in inception score. However, a large batch size would also require a high-performance computational environment. The sample quality across all 5 methods, ran using a batch size of 8000, are compared in Table 1 where the OT_GAN has the best score.<br />
<br />
The OT-GAN was trained using Adam optimizer. The learning rate was set to <math> 0.0003, \beta_1 = 0.5, \beta_2 = 0.999 </math> . The introduced OT-GAN algorithm also includes two additional hyperparameters for the Sinkhorn algorithm. The first hyperparameters indicated the number of iterations to run the algorithm and the second <math> 1 / \lambda </math> the entropy penalty of alignments. The authors found that a value of 500 worked well for both mentioned hyperparameters. The network uses the following architecture:<br />
<br />
[[File: cf10gc.png|600px]]<br />
<br />
[[File: 5_2.png|600px]]<br />
<br />
Figure 4 below shows samples generated by the OT-GAN trained with a batch size of 8000. Figure 5 below shows random samples from a model trained with the same architecture and hyperparameters, but with random matching of samples in place of optimal transport.<br />
<br />
[[File: ot_gan_cifar_10_samples.png|600px]]<br />
<br />
<br />
In order to show the advantage of learning the cost function adversarially, the CIFAR-10 experiment was re-run with the cost as follows:<br />
<br />
[[File: OTGAN_CosineDist.png|250px]]<br />
<br />
When using this fixed cost and keeping the other experiment settings constant, the max inception score dropped from 8.47 with learned to 4.93 with fixed cost functions. The results of the fixed cost are seen in Figure 8 below.<br />
<br />
[[File: OTGAN_fixedDist.png|600px]]<br />
<br />
===ImageNet Dogs===<br />
<br />
In order to investigate the performance of OT-GAN when dealing with the high-quality images, the dog subset of ImageNet (128*128) was used to train the model. Figure 6 shows that OT-GAN produces less nonsensical images and it has a higher inception score compare to the DC-GAN. <br />
<br />
[[File: 5_3.png|600px]]<br />
<br />
<br />
To analyze mode collapse in GANs the authors trained both types of GANs for a large number of epochs. They find the DCGAN shows mode collapse as soon as 900 epochs. They trained the OT-GAN for 13000 epochs and saw no evidence of mode collapse or less diversity in the samples. Samples can be viewed in Figure 9.<br />
<br />
[[File: ModelCollapseImageNetDogs.png|600px]]<br />
<br />
===Conditional Generation of Birds===<br />
<br />
The last experiment was to compare OT-GAN with three popular GAN models for processing the text-to-image generation demonstrating the performance on conditional image synthesis. As can be found from Table 2, OT-GAN received the highest inception score than the scores of the other three models. <br />
<br />
[[File: 5_4.png|600px]]<br />
<br />
The algorithm used to obtain the results above is conditional generation generalized from '''Algorithm 1''' to include conditional information <math>s</math> such as some text description of an image. The modified algorithm is outlined in '''Algorithm 2'''.<br />
<br />
[[File: paper23_alg2.png|600px]]<br />
<br />
==Conclusion==<br />
<br />
In this paper, an OT-GAN method was proposed based on the optimal transport theory. A distance metric that combines the primal form of the optimal transport and the energy distance was given was presented for realizing the OT-GAN. The results showed OT-GAN to be uniquely stable when trained with large mini batches and state of the art results were achieved on some datasets. One of the advantages of OT-GAN over other GAN models is that OT-GAN can stay on the correct track with an unbiased gradient even if the training on critic is stopped or presents a weak cost signal. The performance of the OT-GAN can be maintained when the batch size is increasing, though the computational cost has to be taken into consideration.<br />
<br />
==Critique==<br />
<br />
The paper presents a variant of GANs by defining a new distance metric based on the primal form of optimal transport and the mini-batch energy distance. The stability was demonstrated through the four experiments that comparing OP-GAN with other popular methods. However, limitations in computational efficiency were not discussed much. Furthermore, in section 2, the paper lacks explanation on using mini-batches instead of a vector as input when applying Sinkhorn distance. It is also confusing when explaining the algorithm in section 4 about choosing M for minimizing <math> \mathcal{W}_c </math>. Lastly, it is found that it is lack of parallel comparison with existing GAN variants in this paper. Readers may feel jumping from one algorithm to another without necessary explanations. However, one downside of OT-GAN, as mentioned in the paper, is that it requires large amounts of computation and memory.<br />
<br />
= Discussion =<br />
We have presented OT-GAN, a new variant of GANs where the generator is trained to minimize<br />
a novel distance metric over probability distributions. This metric, which we call mini-batch energy<br />
distance, combines optimal transport in primal form with an energy distance defined in an<br />
adversarially learned feature space, resulting in a highly discriminative distance function with unbiased<br />
mini-batch gradients. OT-GAN was shown to be uniquely stable when trained with large<br />
mini-batches and to achieve state-of-the-art results on several common benchmarks.<br />
One downside of OT-GAN, as currently proposed, is that it requires large amounts of computation<br />
and memory. We achieve the best results when using very large mini-batches, which increases the<br />
time required for each update of the parameters. All experiments in this paper, except for the mixture<br />
of Gaussians toy example, were performed using 8 GPUs and trained for several days. In future work,<br />
we hope to make the method more computationally efficient, as well as to scale up our approach to<br />
multi-machine training to enable generation of even more challenging and high-resolution image<br />
data sets.<br />
A unique property of OT-GAN is that the mini-batch energy distance remains a valid training objective<br />
even when we stop training the critic. Our implementation of OT-GAN updates the generative<br />
model more often than the critic, where GANs typically do this the other way around (see e.g. Gulrajani<br />
et al., 2017). As a result, we learn a relatively stable transport cost function c(x, y), describing<br />
how (dis)similar two images are, as well as an image embedding function vη(x) capturing the geometry<br />
of the training data. Preliminary experiments suggest these learned functions can be used<br />
successfully for unsupervised learning and other applications, which we plan to investigate further<br />
in future work.<br />
<br />
==Reference==<br />
Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. "Improving GANs using optimal transport." (2018).</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=36470Do Deep Neural Networks Suffer from Crowding2018-04-21T04:21:44Z<p>F7xia: /* Drawbacks of CNNs */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
Another common example to visualize the same:<br />
[[File:crowding-tigger.jpg|center|600px]]<br />
<br />
<br />
===Drawbacks of CNNs===<br />
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in the image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs. <br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.<br />
<br />
= Models =<br />
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.<br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).<br />
<br />
Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test. <br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.<br />
<br />
[[File:result2.png|750x400px|center]]<br />
<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.<br />
<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.<br />
<br />
[[File:crowding_at_end_pooling.png|750px|center]]<br />
<br />
[[File:DCNN dataset histogram.png|750px|center]]<br />
<br />
<br />
===DCNN Observations===<br />
* Accuracy decreases with the increase in the number of flankers.<br />
* Unsurprisingly, CNNs are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps the network in learning invariance.<br />
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end. <br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* The recognition accuracy is dependent on the eccentricity of the target object.<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.<br />
<br />
=Conclusions=<br />
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''The Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex. <br />
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed<br />
<br />
=Critique=<br />
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=36469Do Deep Neural Networks Suffer from Crowding2018-04-21T04:20:50Z<p>F7xia: /* Conclusions */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
Another common example to visualize the same:<br />
[[File:crowding-tigger.jpg|center|600px]]<br />
<br />
<br />
===Drawbacks of CNNs===<br />
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs. <br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.<br />
<br />
= Models =<br />
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.<br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).<br />
<br />
Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test. <br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.<br />
<br />
[[File:result2.png|750x400px|center]]<br />
<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.<br />
<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.<br />
<br />
[[File:crowding_at_end_pooling.png|750px|center]]<br />
<br />
[[File:DCNN dataset histogram.png|750px|center]]<br />
<br />
<br />
===DCNN Observations===<br />
* Accuracy decreases with the increase in the number of flankers.<br />
* Unsurprisingly, CNNs are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps the network in learning invariance.<br />
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end. <br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* The recognition accuracy is dependent on the eccentricity of the target object.<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.<br />
<br />
=Conclusions=<br />
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''The Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex. <br />
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed<br />
<br />
=Critique=<br />
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Do_Deep_Neural_Networks_Suffer_from_Crowding&diff=36468Do Deep Neural Networks Suffer from Crowding2018-04-21T04:20:19Z<p>F7xia: /* Critique */</p>
<hr />
<div>= Introduction =<br />
Since the increase in popularity of Deep Neural Networks (DNNs), there has been increased research in making machines capable of recognizing objects the same way humans do. Humans can recognize objects in ways that are invariant to scale, translation, and clutter. Crowding is visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. This paper focuses on studying the impact of crowding on DNNs trained for object recognition by adding clutter to the images and then analyzing which models and settings suffer less from such effects. <br />
<br />
[[File:paper25_fig_crowding_ex.png|center|600px]]<br />
The figure shows a visual example of crowding [3]. Keep your eyes still and look at the dot in the center and try to identify the "A" in the two circles. You should see that it is much easier to make out the "A" in the right than in the left circle. The same "A" exists in both circles, however, the left circle contains flankers which are those line segments.<br />
<br />
Another common example to visualize the same:<br />
[[File:crowding-tigger.jpg|center|600px]]<br />
<br />
<br />
===Drawbacks of CNNs===<br />
CNNs fall short in explaining human perceptual invariance. Firstly, CNNs typically take input at a single uniform resolution. Biological measurements suggest that resolution is not uniform across the human visual field, but rather decays with eccentricity, i.e. distance from the center of focus. The major cause of this issue is the pooling layer in CNN structure. The pooling is an efficient technique but loses important spatial information. Pooling is also not capable to capture the hierarchical structure in image, which is also crucial to view point problems. Even more importantly, CNNs rely not only on weight-sharing but also on data augmentation to achieve transformation invariance and so obviously a lot of processing is needed for CNNs. <br />
<br />
The paper investigates two types of DNNs for crowding: traditional deep convolutional neural networks (DCNN) and a multi-scale eccentricity-dependent model which is an extension of the DCNNs and inspired by the retina where the receptive field size of the convolutional filters in the model grows with increasing distance from the center of the image, called the eccentricity and is explained below. The authors focus on the dependence of crowding on image factors, such as flanker configuration, target-flanker similarity, target eccentricity and premature pooling in particular. Along with that, there is major emphasis on reducing the training time of the networks since the motive is to have a simple network capable of learning space-invariant features.<br />
<br />
= Models =<br />
The authors describe two kinds of DNN architectures: Deep Convolutional Neural Networks, and eccentricity dependent networks, with varying pooling strategies across space and scale. Of particular note is the pooling operation, as many researchers have suggested that this may be the cause of crowding in human perception.<br />
<br />
== Deep Convolutional Neural Networks ==<br />
The DCNN is a basic architecture with 3 convolutional layers, spatial 3x3 max-pooling with varying strides and a fully connected layer for classification as shown in the below figure. <br />
[[File:DCNN.png|800px|center]]<br />
<br />
The network is fed with images resized to 60x60, with mini-batches of 128 images, 32 feature channels for all convolutional layers, and convolutional filters of size 5x5 and stride 1.<br />
<br />
As highlighted earlier, the effect of pooling is into main consideration and hence three different configurations have been investigated as below: <br />
<br />
# '''No total pooling''' Feature maps sizes decrease only due to boundary effects, as the 3x3 max pooling has stride 1. The square feature maps sizes after each pool layer are 60-54-48-42.<br />
# '''Progressive pooling''' 3x3 pooling with a stride of 2 halves the square size of the feature maps, until we pool over what remains in the final layer, getting rid of any spatial information before the fully connected layer. (60-27-11-1).<br />
# '''At end pooling''' Same as no total pooling, but before the fully connected layer, max-pool over the entire feature map. (60-54-48-1).<br />
<br />
==Eccentricity-dependent Model==<br />
In order to take care of the scale invariance in the input image, the eccentricity dependent DNN is utilized. This was proposed as a model of the human visual cortex by [https://arxiv.org/pdf/1406.1770.pdf, Poggio et al] and later further studied in [2]. The main intuition behind this architecture is that as we increase eccentricity, the receptive fields also increase and hence the model will become invariant to changing input scales. The authors note that the width of each scale is roughly related to the amount of translation invariance for objects at that scale, simply because once the object is outside that window, the filter no longer observes it. Therefore, the authors say that the architecture emphasizes scale invariance over translation invariance, in contrast to traditional DCNNs. From a biological perspective, eye movement can compensate for the limitations of translation invariance, but compensating for scale invariance requires changing distance from the object. In this model, the input image is cropped into varying scales (11 crops increasing by a factor of <math>\sqrt{2}</math> which are then resized to 60x60 pixels) and then fed to the network. Exponentially interpolated crops are used over linearly interpolated crops since they produce fewer boundary effects while maintaining the same behavior qualitatively. The model computes an invariant representation of the input by sampling the inverted pyramid at a discrete set of scales with the same number of filters at each scale. Since the same number of filters are used for each scale, the smaller crops will be sampled at a high resolution while the larger crops will be sampled with a low resolution. These scales are fed into the network as an input channel to the convolutional layers and share the weights across scale and space. Due to the downsampling of the input image, this is equivalent to having receptive fields of varying sizes. Intuitively, this means that the network generalizes learnings across scales and is guaranteed by during back-propagation by averaging the error derivatives over all scale channels, then using the averages to compute weight adjustments. The same set of weight adjustments to the convolutional units across different scale channels is applied.<br />
[[File:EDM.png|2000x450px|center]]<br />
<br />
<br />
The architecture of this model is the same as the previous DCNN model with the only change being the extra filters added for each of the scales, so the number of parameters remains the same as DCNN models. The authors perform spatial pooling, the aforementioned ''At end pooling'' is used here, and scale pooling which helps in reducing the number of scales by taking the maximum value of corresponding locations in the feature maps across multiple scales. It has three configurations: (1) at the beginning, in which all the different scales are pooled together after the first layer, 11-1-1-1-1 (2) progressively, 11-7-5-3-1 and (3) at the end, 11-11-11-11-1, in which all 11 scales are pooled together at the last layer.<br />
<br />
===Contrast Normalization===<br />
Since there are multiple scales of an input image, in some experiments, normalization is performed such that the sum of the pixel intensities in each scale is in the same range [0,1] (this is to prevent smaller crops, which have more non-black pixels, from disproportionately dominating max-pooling across scales). The normalized pixel intensities are then divided by a factor proportional to the crop area [[File:sqrtf.png|60px]] where i=1 is the smallest crop.<br />
<br />
=Experiments=<br />
Targets are the set of objects to be recognized and flankers are the set of objects the model has not been trained to recognize, which act as clutter with respect to these target objects. The target objects are the even MNIST numbers having translational variance (shifted at different locations of the image along the horizontal axis), while flankers are from odd MNIST numbers, not MNIST dataset (contains alphabet letters) and Omniglot dataset (contains characters). Examples of the target and flanker configurations are shown below: <br />
[[File:eximages.png|800px|center]]<br />
<br />
The target and the object are referred to as ''a'' and ''x'' respectively with the below four configurations: <br />
# No flankers. Only the target object. (a in the plots) <br />
# One central flanker closer to the center of the image than the target. (xa) <br />
# One peripheral flanker closer to the boundary of the image that the target. (ax) <br />
# Two flankers spaced equally around the target, being both the same object, see Figure 1 above for an example (xax).<br />
<br />
Training is done using backpropagation with images of size <math>1920 px^2</math> with embedded targets objects and flankers of size of <math>120 px^2</math>. The training and test images are divided as per the usual MNIST configuration. To determine if there is a difference between the peripheral flankers and the central flankers, all the tests are performed in the right half image plane.<br />
<br />
==DNNs trained with Target and Flankers==<br />
This is a constant spacing training setup where identical flankers are placed at a distance of 120 pixels either side of the target(xax) with the target having translational variance. The tests are evaluated on (i) DCNN with at the end pooling, and (ii) eccentricity-dependent model with 11-11-11-11-1 scale pooling, at the end spatial pooling and contrast normalization. The results are reported by different flanker types <math>(xax,ax, xa)</math> at test. <br />
[[File:result1.png|x450px|center]]<br />
<br />
===Observations===<br />
* With the flanker configuration same as the training one, models are better at recognizing objects in clutter rather than isolated objects for all image locations<br />
* If the target-flanker spacing is changed, then models perform worse<br />
* the eccentricity model is much better at recognizing objects in isolation than the DCNN because the multi-scale crops divide the image into discrete regions, letting the model learn from image parts as well as the whole image<br />
* Only the eccentricity-dependent model is robust to different flanker configurations not included in training when the target is centered.<br />
<br />
==DNNs trained with Images with the Target in Isolation==<br />
Here the target objects are in isolation and with translational variance while the test-set is the same set of flanker configurations as used before. The constant spacing and constant eccentricity effect have been evaluated.<br />
<br />
[[File:result2.png|750x400px|center]]<br />
<br />
In addition to the evaluation of DCNNs in constant target eccentricity at 240 pixels, here they are tested with images in which the target is fixed at 720 pixels from the center of the image, as shown in Fig 3. Since the target is already at the edge of the visual field, a flanker cannot be more peripheral in the image than the target. Same results as for the 240 pixels target eccentricity can be extracted. The closer the flanker is to the target, the more accuracy decreases. Also, it can be seen that when the target is close to the image boundary, recognition is poor because of boundary effects eroding away information about the target.<br />
<br />
[[File:paper25_supplemental1.png|800px|center]]<br />
<br />
The authors also test the effect of flankers from different datasets on a DCNN model with at end pooling, with results shown in Fig. 7 below. Omniglot flankers crowd less than MNIST digits, and the authors note that this is because they are visually similar to MNIST digits, but are not actually digits, and thus activate the model's convolutional filters less than MNIST digits. The notMNIST digits however, result it more crowding. This is due to the fact that the different font style results in more high intensity pixels and edges. The intensity distributions for the 3 datasets is shown in the histograms in Fig. 12 below. The correlation between crowding and relative frequency of high intensity pixels can be seen from this figure.<br />
<br />
[[File:crowding_at_end_pooling.png|750px|center]]<br />
<br />
[[File:DCNN dataset histogram.png|750px|center]]<br />
<br />
<br />
===DCNN Observations===<br />
* Accuracy decreases with the increase in the number of flankers.<br />
* Unsurprisingly, CNNs are capable of being invariant to translations.<br />
* In the constant target eccentricity setup, where the target is fixed at the center of the image with varying target-flanker spacing, we observe that as the distance between target and flankers increase, recognition gets better.<br />
* Spatial pooling helps the network in learning invariance.<br />
* Flankers similar to the target object helps in recognition since they activate the convolutional filter more.<br />
* notMNIST data affects leads to more crowding since they have many more edges and white image pixels which activate the convolutional layers more.<br />
<br />
===Eccentric Model===<br />
The set-up is the same as explained earlier. The spacial pooling keeps constant. The effect of pooling across scales are investigated. The three configurations for scale pooling are (i) at the beginning, (ii)progressively and (iii) at the end. <br />
[[File:result3.png|750x400px|center]]<br />
<br />
====Observations====<br />
* The recognition accuracy is dependent on the eccentricity of the target object.<br />
* If the target is placed at the center and no contrast normalization is done, then the recognition accuracy is high since this model concentrates the most on the central region of the image.<br />
* If contrast normalization is done, then all the scales will contribute equal amount and hence the eccentricity dependence is removed.<br />
* Early pooling is harmful since it might take away the useful information very early which might be useful to the network.<br />
<br />
Without contrast normalization, the middle portion of the image can be focused more with high resolution so the target at the center with no normalization performs well in that case. But if normalization is done, then all the segments of the image contribute to the classification and hence the overall accuracy is not that great but the system becomes robust to the changes in eccentricity.<br />
<br />
==Complex Clutter==<br />
Here, the targets are randomly embedded into images of the Places dataset and shifted along horizontally in order to investigate model robustness when the target is not at the image center. Tests are performed on DCNN and the eccentricity model with and without contrast normalization using at end pooling. The results are shown in Figure 9 below. <br />
<br />
[[File:result4.png|750x400px|center]]<br />
<br />
====Observations====<br />
* Only eccentricity model without contrast normalization can recognize the target and only when the target is close to the image center.<br />
* The eccentricity model does not need to be trained on different types of clutter to become robust to those types of clutter, but it needs to fixate on the relevant part of the image to recognize the target. If it can fixate on the relevant part of the image, it can still discriminate it, even at different scales. This implies that the eccentricity model is robust to clutter.<br />
<br />
=Conclusions=<br />
This paper investigates the effect of crowding on a DNN. Using a simple technique of adding clutter in the model didn't improve the performance. We often think that just training the network with data similar to the test data would achieve good results in a general scenario too but that's not the case as we trained the model with flankers and it did not give us the ideal results for the target objects. The following 4 techniques influenced crowding in DNN:<br />
*'''Flanker Configuration''': When models are trained with images of objects in isolation, adding flankers harms recognition. Adding two flankers is the same or worse than adding just one and the smaller the spacing between flanker and target, the more crowding occurs. This is because the pooling operation merges nearby responses, such as the target and flankers if they are close.<br />
*'''Similarity between target and flanker''': Flankers more similar to targets cause more crowding, because of the selectivity property of the learned DNN filters.<br />
*'''Dependence on target location and contrast normalization''': In DCNNs and eccentricity-dependent models with contrast normalization, recognition accuracy is the same across all eccentricities. In eccentricity-dependent networks without contrast normalization, recognition does not decrease despite the presence of clutter when the target is at the center of the image.<br />
*'''Effect of pooling''': adding pooling leads to better recognition accuracy of the models. Yet, in the eccentricity model, pooling across the scales too early in the hierarchy leads to lower accuracy.<br />
* The Eccentricity Dependent Models can be used for modeling the feedforward path of the primate visual cortex. <br />
* If target locations are proposed, then the system can become even more robust and hence a simple network can become robust to clutter while also reducing the amount of training data and time needed<br />
<br />
=Critique=<br />
This paper only tries to check the impact of flankers on targets as to how crowding can affect recognition but it does not propose anything novel in terms of architecture to take care of such type of crowding. The paper only shows that the eccentricity based model does better (than plain DCNN model) when the target is placed at the center of the image but maybe windowing over the frames the same way that a convolutional model passes a filter over an image, instead of taking crops starting from the middle, might help.<br />
<br />
This paper focuses on image classification. For a stronger argument, their model could be applied to the task of object detection. Perhaps crowding does not have as large of an impact when the objects of interest are localized by a region proposal network. Further, the artificial crowding introduced in the paper may not be random enough for the neural network to learn to classify the object of interest as opposed to the entire cluster of objects. For example, in the case of an even MNIST digit being flanked by two odd MNIST digits, there are only 25 possible combinations of flankers and targets.<br />
<br />
This paper does not provide a convincing argument that the problem of crowding as experienced by humans somehow shares a similar mechanism to the problem of DNN accuracy falling when there is more clutter in the scene. The multi-scale architecture does not appear similar to the distribution of rods and cones in the retina[https://www.ncbi.nlm.nih.gov/books/NBK10848/figure/A763/?report=objectonly]. It might be that the eccentric model does well when the target is centered because it is being sampled by more scales, not because it is similar to a primate visual cortex, and primates are able to recognize an object in clutter when looking directly at it.<br />
<br />
=References=<br />
# Volokitin A, Roig G, Poggio T:"Do Deep Neural Networks Suffer from Crowding?" Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Francis X. Chen, Gemma Roig, Leyla Isik, Xavier Boix and Tomaso Poggio: "Eccentricity Dependent Deep Neural Networks for Modeling Human Vision" Journal of Vision. 17. 808. 10.1167/17.10.808.<br />
# J Harrison, W & W Remington, R & Mattingley, Jason. (2014). Visual crowding is anisotropic along the horizontal meridian during smooth pursuit. Journal of vision. 14. 10.1167/14.1.21. http://willjharrison.com/2014/01/new-paper-visual-crowding-is-anisotropic-along-the-horizontal-meridian-during-smooth-pursuit/</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers&diff=36466stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers2018-04-21T04:18:58Z<p>F7xia: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.<br />
<br />
In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable. <br />
<br />
Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.<br />
<br />
== Motivation ==<br />
<br />
Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:<br />
<br />
$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$<br />
<br />
where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.<br />
<br />
Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.<br />
<br />
<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center><br />
<br />
Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.<br />
<br />
In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:<br />
<br />
1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;<br />
<br />
2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.<br />
<br />
Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.<br />
<br />
== Method ==<br />
<br />
At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.<br />
<br />
=== Summary ===<br />
<br />
The basic algorithm can be summarized as follows:<br />
<br />
1. Penalize the L1-norm of the batch normalization scaling parameters in the loss<br />
<br />
2. Train until loss plateaus<br />
<br />
3. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
4. Fine-tune the pruned model using regular learning<br />
<br />
=== Details ===<br />
<br />
There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.<br />
<br />
==== Slow Convergence ====<br />
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$<br />
<br />
Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.<br />
<br />
==== Penalty Normalization ====<br />
<br />
In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)<br />
<br />
To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.<br />
<br />
=== Steps ===<br />
<br />
The final algorithm can be summarized as follows:<br />
<br />
1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math><br />
<br />
2. Compute the global LASSO loss with global scaling constant <math>\rho</math><br />
<br />
3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.<br />
<br />
4. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
5. Fine-tune the pruned model using regular learning<br />
<br />
== Results ==<br />
<br />
The authors show state-of-the-art performance, compared with other channel-pruning approaches. It is important to note that it would be unfair to compare against general pruning approaches; channel pruning specifically removes channels without introducing '''intra-kernel sparsity''', whereas other pruning approaches introduce irregular kernel sparsity and hence computational inefficiencies.<br />
<br />
=== CIFAR-10 Experiment ===<br />
<br />
Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008. <br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-25.png]]<br />
<br />
For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.<br />
<br />
=== ILSVRC2012 Experiment ===<br />
<br />
The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-36.png]]<br />
<br />
=== Image Foreground-Background Segmentation Experiment ===<br />
<br />
The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below<br />
<br />
[[File:paper8_Segmentation.png|700px]]<br />
<br />
The neural network used in this experiment is composed of two branches:<br />
* An inception branch that locates the foreground objects<br />
* A DenseNet branch to regress the edges<br />
<br />
It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.<br />
<br />
[[File:pruned_inception.png|600px]]<br />
<br />
== Conclusion ==<br />
<br />
Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.<br />
<br />
It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.<br />
<br />
In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.<br />
<br />
== Implementation == <br />
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning<br />
<br />
<br />
== References ==<br />
<br />
* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).<br />
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.<br />
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.<br />
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.<br />
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.<br />
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).<br />
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf<br />
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.<br />
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Rethinking_the_Smaller-Norm-Less-Informative_Assumption_in_Channel_Pruning_of_Convolutional_Layers&diff=36465stat946w18/Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolutional Layers2018-04-21T04:18:49Z<p>F7xia: /* Results */</p>
<hr />
<div>== Introduction ==<br />
<br />
With the recent and ongoing surge in low-power, intelligent agents (such as wearables, smartphones, and IoT devices), there exists a growing need for machine learning models to work well in resource-constrained environments. Deep learning models have achieved state-of-the-art on a broad range of tasks; however, they are difficult to deploy in their original forms. For example, AlexNet (Krizhevsky et al., 2012), a model for image classification, contains 61 million parameters and requires 1.5 billion floating point operations (FLOPs) in one inference pass. A more accurate model, ResNet-50 (He et al., 2016), has 25 million parameters but requires 4.08 FLOPs. A high-end desktop GPU such as a Titan Xp is capable of [https://www.nvidia.com/en-us/titan/titan-xp/ (12 TFLOPS (tera-FLOPs per second))], while the Adreno 540 GPU used in a Samsung Galaxy S8 is only capable of [https://gflops.surge.sh (567 GFLOPS)] which is less than 5% of the Titan Xp. Clearly, it would be difficult to deploy and run these models on low-power devices.<br />
<br />
In general, model compression can be accomplished using four main non-mutually exclusive methods (Cheng et al., 2017): weight pruning, quantization, matrix transformations, and weight tying. By non-mutually exclusive, we mean that these methods can be used not only separately but also in combination for compressing a single model; the use of one method does not exclude any of the other methods from being viable. <br />
<br />
Ye et al. (2018) explores pruning entire channels in a convolutional neural network (CNN). Past work has mostly focused on norm[based or error-based heuristics to prune channels; instead, Ye et al. (2018) show that their approach is easily reproducible and has favorable qualities from an optimization standpoint. In other words, they argue that the norm-based assumption is not as informative or theoretically justified as their approach, and provide strong empirical evidence of these findings.<br />
<br />
== Motivation ==<br />
<br />
Some previous works on pruning channel filters (Li et al., 2016; Molchanov et al., 2016) have focused on using the L1 norm to determine the importance of a channel. Ye et al. (2018) show that, in the deep linear convolution case, penalizing the per-layer norm is coarse-grained; they argue that one cannot assign different coefficients to L1 penalties associated with different layers without risking the loss function being susceptible to trivial re-parameterizations. As an example, consider the following deep linear convolutional neural network with modified LASSO loss:<br />
<br />
$$\min \mathbb{E}_D \lVert W_{2n} * \dots * W_1 x - y\rVert^2 + \lambda \sum_{i=1}^n \lVert W_{2i} \rVert_1$$<br />
<br />
where W are the weights and * is convolution. Here we have chosen the coefficient 0 for the L1 penalty associated with odd-numbered layers and the coefficient 1 for the L1 penalty associated with even-numbered layers. This loss is susceptible to trivial re-parameterizations: without affecting the least-squares loss, we can always reduce the LASSO loss by halving the weights of all even-numbered layers and doubling the weights of all odd-numbered layers.<br />
<br />
Furthermore, batch normalization (Ioffe, 2015) is incompatible with this method of weight regularization. Consider batch normalization at the <math>l</math>-th layer.<br />
<br />
<center><math>x^{l+1} = max\{\gamma \cdot BN_{\mu,\sigma,\epsilon}(W^l * x^l) + \beta, 0\}</math></center><br />
<br />
Due to the batch normalization, any uniform scaling of <math>W^l</math> which would change <math>l_1</math> and <math>l_2</math> norms, but has no have no effect on <math>x^{l+1}</math>. Thus, when trying to minimize weight norms of multiple layers, it is unclear how to properly choose penalties for each layer. Therefore, penalizing the norm of a filter in a deep convolutional network is hard to justify from a theoretical perspective.<br />
<br />
In contrast with these existing approaches, the authors focus on enforcing sparsity of a tiny set of parameters in CNN — scale parameter <math>\gamma</math> in all batch normalization. Not only placing sparse constraints on <math>\gamma</math> is simpler and easier to monitor, but more importantly, they put forward two reasons:<br />
<br />
1. Every <math>\gamma</math> always multiplies a normalized random variable, thus the channel importance becomes comparable across different layers by measuring the magnitude values of <math>\gamma</math>;<br />
<br />
2. The reparameterization effect across different layers is avoided if its subsequent convolution layer is also batch-normalized. In other words, the impacts from the scale changes of <math>\gamma</math> parameter are independent across different layers.<br />
<br />
Thus, although not providing a complete theoretical guarantee on loss, Ye et al. (2018) develop a pruning technique that claims to be more justified than norm-based pruning is.<br />
<br />
== Method ==<br />
<br />
At a high level, Ye et al. (2018) propose that, instead of discovering sparsity via penalizing the per-filter or per-channel norm, penalize the batch normalization scale parameters ''gamma'' instead. The reasoning is that by having fewer parameters to constrain and working with normalized values, sparsity is easier to enforce, monitor, and learn. Having sparse batch normalization terms has the effect of pruning '''entire''' channels: if ''gamma'' is zero, then the output at that layer becomes constant (the bias term), and thus the preceding channels can be pruned.<br />
<br />
=== Summary ===<br />
<br />
The basic algorithm can be summarized as follows:<br />
<br />
1. Penalize the L1-norm of the batch normalization scaling parameters in the loss<br />
<br />
2. Train until loss plateaus<br />
<br />
3. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
4. Fine-tune the pruned model using regular learning<br />
<br />
=== Details ===<br />
<br />
There still exist a few problems that this summary has not addressed so far. Sub-gradient descent is known to have inverse square root convergence rate on subdifferentials (Gordon et al., 2012), so the sparsity gradient descent update may be suboptimal. Furthermore, the sparse penalty needs to be normalized with respect to previous channel sizes, since the penalty should be roughly equally distributed across all convolution layers.<br />
<br />
==== Slow Convergence ====<br />
To address the issue of slow convergence, Ye et al. (2018) use an iterative shrinking-thresholding algorithm (ISTA) (Beck & Teboulle, 2009) to update the batch normalization scale parameter. The intuition for ISTA is that the structure of the optimization objective can be taken advantage of. Consider: $$L(x) = f(x) + g(x).$$<br />
<br />
Let ''f'' be the model loss and ''g'' be the non-differentiable penalty (LASSO). ISTA is able to use the structure of the loss and converge in O(1/n), instead of O(1/sqrt(n)) when using subgradient descent, which assumes no structure about the loss. Even though ISTA is used in convex settings, Ye et. al (2018) argue that it still performs better than gradient descent.<br />
<br />
==== Penalty Normalization ====<br />
<br />
In the paper, Ye et al. (2018) normalize the per-layer sparse penalty with respect to the global input size, the current layer kernel areas, the previous layer kernel areas, and the local input feature map area.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-06-41.png]] (Ye et al., 2018)<br />
<br />
To control the global penalty, a hyperparamter ''rho'' is multiplied with all the per-layer ''lambda'' in the final loss.<br />
<br />
=== Steps ===<br />
<br />
The final algorithm can be summarized as follows:<br />
<br />
1. Compute the per-layer normalized sparse penalty constant <math>\lambda</math><br />
<br />
2. Compute the global LASSO loss with global scaling constant <math>\rho</math><br />
<br />
3. Until convergence, train scaling parameters using ISTA and non-scaling parameters using regular gradient descent.<br />
<br />
4. Remove channels that correspond to a downstream zero in batch normalization<br />
<br />
5. Fine-tune the pruned model using regular learning<br />
<br />
== Results ==<br />
<br />
<br />
<br />
=== CIFAR-10 Experiment ===<br />
<br />
Model A is trained with a sparse penalty of <math>\rho = 0.0002</math> for 30 thousand steps, and then increased to <math>\rho = 0.001</math>. Model B is trained by taking Model A and increasing the sparse penalty up to 0.002. Similarly Model C is a continuation of Model B with a penalty of 0.008. <br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-25.png]]<br />
<br />
For the convNet, reducing the number of parameters in the base model increased the accuracy in model A. This suggests that the base model is over-parameterized. Otherwise, there would be a trade-off of accuracy and model efficiency.<br />
<br />
=== ILSVRC2012 Experiment ===<br />
<br />
The authors note that while ResNet-101 takes hundreds of epochs to train, pruning only takes 5-10, with fine-tuning adding another 2, giving an empirical example how long pruning might take in practice. Both models were trained with an aggressive sparsity penalty of 0.1.<br />
<br />
[[File:Screenshot_from_2018-02-28_17-24-36.png]]<br />
<br />
=== Image Foreground-Background Segmentation Experiment ===<br />
<br />
The authors note that it is common practice to take a network with pre-trained on a large task and fine-tune it to apply it to a different, smaller task. One might expect there might be some extra channels that while useful for the large task, can be omitted for the simpler task. This experiment replicated that use-case by taking a NN originally trained on multiple datasets and applying the proposed pruning method. The authors note that the pruned network actually improves over the original network in all but the most challenging test dataset, which is in line with the initial expectation. The model was trained with a sparsity penalty of 0.5 and the results are shown in table below<br />
<br />
[[File:paper8_Segmentation.png|700px]]<br />
<br />
The neural network used in this experiment is composed of two branches:<br />
* An inception branch that locates the foreground objects<br />
* A DenseNet branch to regress the edges<br />
<br />
It was found that the pruning primarily affected the inception branch as shown in Figure 1 below. This likely explains the poor performance on more challenging datasets as a result of a higher requirement on foreground objects, which has been impacted by the pruning of the inception branch.<br />
<br />
[[File:pruned_inception.png|600px]]<br />
<br />
== Conclusion ==<br />
<br />
Pruning large neural architectures to fit on low-power devices is an important task. For a real quantitative measure of efficiency, it would be interesting to conduct actual power measurements on the pruned models versus baselines; reduction in FLOPs doesn't necessarily correspond with vastly reduced power since memory accesses dominate energy consumption (Han et al., 2015). However, the reduction in the number of FLOPs and parameters is encouraging, so moderate power savings should be expected.<br />
<br />
It would also be interesting to combine multiple approaches, or "throw the whole kitchen sink" at this task. Han et al. (2015) sparked much recent interest by successfully combining weight pruning, quantization, and Huffman coding without loss in accuracy. However, their approach introduced irregular sparsity in the convolutional layers, so a direct comparison cannot be made.<br />
<br />
In conclusion, this novel, theoretically-motivated interpretation of channel pruning was successfully applied to several important tasks.<br />
<br />
== Implementation == <br />
A PyTorch implementation is available here: https://github.com/jack-willturner/batchnorm-pruning<br />
<br />
<br />
== References ==<br />
<br />
* Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).<br />
* He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).<br />
* Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv preprint arXiv:1710.09282.<br />
* Ye, J., Lu, X., Lin, Z., & Wang, J. Z. (2018). Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers. arXiv preprint arXiv:1802.00124.<br />
* Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.<br />
* Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.<br />
* Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456).<br />
* Gordon, G., & Tibshirani, R. (2012). Subgradient method. https://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf<br />
* Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), 183-202.<br />
* Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation&diff=36463stat946w18/MaskRNN: Instance Level Video Object Segmentation2018-04-21T04:15:25Z<p>F7xia: /* MaskRNN: Binary Segmentation Network */</p>
<hr />
<div>== Introduction ==<br />
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.<br />
<br />
There are 2 different types of video object segmentation, Unsupervised and Semi-supervised. <br />
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. <br />
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. <br />
<br />
In this paper, the authors look at an unsupervised video object segmentation technique.<br />
<br />
== Background Papers ==<br />
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.<br />
<br />
=== OSVOS (One-Shot Video Object Segmentation) ===<br />
<br />
[[File:OSVOS.jpg | 1000px]]<br />
<br />
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.<br />
<br />
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.<br />
<br />
=== MaskTrack (Learning Video Object Segmentation from Static Images) ===<br />
<br />
[[File:MaskTrack.jpg | 500px]]<br />
<br />
MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.<br />
<br />
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.<br />
<br />
A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.<br />
<br />
Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.<br />
<br />
[[File:Paper19-SegmentationComp.png]]<br />
<br />
== Dataset ==<br />
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.<br />
<br />
== MaskRNN: Introduction ==<br />
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.<br />
<br />
== MaskRNN: Overview ==<br />
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.<br />
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.<br />
<br />
== MaskRNN: Multiple Instance Level Segmentation ==<br />
<br />
[[File:2ObjectSeg.jpg | 850px]]<br />
<br />
Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.<br />
<br />
=== MaskRNN: Binary Segmentation Network ===<br />
<br />
[[File:MaskRNNDeepNet.jpg | 850px]]<br />
<br />
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size as the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate a binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.<br />
<br />
=== MaskRNN: Object Localization Network: ===<br />
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.<br />
<br />
=== Training and Finetuning ===<br />
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.<br />
<br />
== MaskRNN: Implementation Details ==<br />
=== Offline Training ===<br />
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information. <br />
<br />
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. ) <br />
<br />
=== Online Finetuning ===<br />
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.<br />
<br />
== MaskRNN: Experimental Results ==<br />
=== Evaluation Metrics ===<br />
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:<br />
<br />
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.<br />
<br />
\begin{equation}<br />
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects. <br />
<br />
[[File:Fscore.jpg | 200px|center]]<br />
<br />
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.<br />
<br />
Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.<br />
<br />
=== Ablation Study ===<br />
<br />
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.<br />
<br />
[[File:MaskRNNTable2.jpg | 700px|center]]<br />
<br />
The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net. The FStream provides information on motion boundaries which help in videos with cluttered backgrounds, the RNN provides more consistent segmentation masks over time. The localization net has a more ambiguous effect on the network; adding the bounding box regression loss decreases the performance of the segmentation net but applying the bounding box to restrict the segmentation mask improves the results over those achieved by only using the segmentation net. In other words, the localization net should only be used in conjunction with the segmentation net while the segmentation net can be used by itself.<br />
<br />
=== Quantitative Evaluation ===<br />
<br />
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.<br />
<br />
[[File:MaskRNNTable3.jpg | 700px]]<br />
<br />
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.<br />
<br />
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.<br />
<br />
[[File:MaskRNNTable4.jpg | 700px]]<br />
<br />
=== Qualitative Evaluation ===<br />
The authors showed example qualitative results from the DAVIS and Segtrack datasets. <br />
<br />
Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.<br />
<br />
[[File:maskrnn_example.png | 700px]]<br />
<br />
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).<br />
<br />
[[File:maskrnn_example_fail.png | 700px]]<br />
<br />
== Conclusion ==<br />
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Due to the recurrent component and the combination of segmentation and localization nets, the approach takes advantage of the long-term temporal information and the location prior to improve the results. Using online fine-tuning the network is adjusted to predict better for the current video sequence.<br />
<br />
== Critique ==<br />
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.<br />
<br />
== Implementation ==<br />
<br />
The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.<br />
<br />
== References ==<br />
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos<br />
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.<br />
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.<br />
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.<br />
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.<br />
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/MaskRNN:_Instance_Level_Video_Object_Segmentation&diff=36462stat946w18/MaskRNN: Instance Level Video Object Segmentation2018-04-21T04:13:33Z<p>F7xia: /* Ablation Study */</p>
<hr />
<div>== Introduction ==<br />
Deep Learning has produced state of the art results in many computer vision tasks like image classification, object localization, object detection, object segmentation, semantic segmentation and instance level video object segmentation. Image classification classify the image based on the prominent objects. Object localization is the task of finding objects’ location in the frame. Object Segmentation task involves providing a pixel map which represents the pixel wise location of the objects in the image. Semantic segmentation task attempts at segmenting the image into meaningful parts. Instance level video object segmentation is the task of consistent object segmentation in video sequences. Deforming shapes, fast movements, and occlusion from multiple objects, are just some of the significant challenges in instance level video object segmentation.<br />
<br />
There are 2 different types of video object segmentation, Unsupervised and Semi-supervised. <br />
* In unsupervised video object segmentation, the task is to find the salient objects and track the main objects in the video. <br />
* In a semi-supervised setting, the ground truth mask of the salient objects is provided for the first frame. The task is thus simplified to only track the objects required. <br />
<br />
In this paper, the authors look at an unsupervised video object segmentation technique.<br />
<br />
== Background Papers ==<br />
Video object segmentation has been performed using spatio-temporal graphs [[https://pdfs.semanticscholar.org/7221/c3470fa89879aab3ef270570ced15cde28de.pdf 5], [http://ieeexplore.ieee.org/abstract/document/5539893/ 6], [http://openaccess.thecvf.com/content_iccv_2013/papers/Li_Video_Segmentation_by_2013_ICCV_paper.pdf 7], [https://link.springer.com/content/pdf/10.1007/s11263-011-0512-5.pdf 8]] and deep learning. The graph based methods construct 3D spatio-temporal graphs in order to model the inter and the intra-frame relationship of pixels or superpixels in a video. Hence they are computationally slower than deep learning methods and are unable to run at real-time. There are 2 main deep learning techniques for semi-supervised video object segmentation: One Shot Video Object Segmentation (OSVOS) and Learning Video Object Segmentation from Static Images (MaskTrack). Following is a brief description of the new techniques introduced by these papers for semi-supervised video object segmentation task.<br />
<br />
=== OSVOS (One-Shot Video Object Segmentation) ===<br />
<br />
[[File:OSVOS.jpg | 1000px]]<br />
<br />
This paper introduces the technique of using a frame-by-frame object segmentation without any temporal information from the previous frames of the video. The paper uses a VGG-16 network with pre-trained weights from image classification task. This network is then converted into a fully-connected network (FCN) by removing the fully connected dense layers at the end and adding convolution layers to generate a segment mask of the input. This network is then trained on the DAVIS 2016 dataset.<br />
<br />
During testing, the trained VGG-16 FCN is fine-tuned using the first frame of the video using the ground truth. Because this is a semi-supervised case, the segmented mask (ground truth) for the first frame is available. The first frame data is augmented by zooming/rotating/flipping the first frame and the associated segment mask.<br />
<br />
=== MaskTrack (Learning Video Object Segmentation from Static Images) ===<br />
<br />
[[File:MaskTrack.jpg | 500px]]<br />
<br />
MaskTrack takes the output of the previous frame to improve its predictions and to generate the segmentation mask for the next frame. Thus the input to the network is 4 channel wide (3 RGB channels from the frame at time <math>t</math> plus one binary segmentation mask from frame <math>t-1</math>). The output of the network is the binary segmentation mask for frame at time <math>t</math>. Using the binary segmentation mask (referred to as guided object segmentation in the paper), the network is able to use some temporal information from the previous frame to improve its segmentation mask prediction for the next frame.<br />
<br />
The model of the MaskTrack network is similar to a modular VGG-16 and is referred to as MaskTrack ConvNet in the paper. The network is trained offline on saliency segmentation datasets: ECSSD, MSRA 10K, SOD and PASCAL-S. The input mask for the binary segmentation mask channel is generated via non-rigid deformation and affine transformation of the ground truth segmentation mask. Similar data-augmentation techniques are also used during online training. Just like OSVOS, MaskTrack uses the first frame as ground truth (with augmented images) to fine-tune the network to improve prediction score for the particular video sequence.<br />
<br />
A parallel ConvNet network is used to generate a predicted segment mask based on the optical flow magnitude. The optical flow between 2 frames is calculated using the EpicFlow algorithm. The output of the two networks is combined using an averaging operation to generate the final predicted segmented mask.<br />
<br />
Table 1 below gives a summary comparison of the different state of the art algorithms. The noteworthy information included in this table is that the technique presented in this paper is the only one which takes into account long-term temporal information. This is accomplished with a recurrent neural net. Furthermore, the bounding box is also estimated instead of just a segmentation mask. The authors claim that this allows the incorporation of a location prior from the tracked object.<br />
<br />
[[File:Paper19-SegmentationComp.png]]<br />
<br />
== Dataset ==<br />
The three major datasets used in this paper are DAVIS-2016, DAVIS-2017 and Segtrack v2. DAVIS-2016 dataset provides video sequences with only one segment mask for all salient objects. DAVIS-2017 improves the ground truth data by providing segmentation mask for each salient object as a separate color segment mask. Segtrack v2 also provides multiple segmentation mask for all salient objects in the video sequence. These datasets try to recreate real-life scenarios like occlusions, low resolution videos, background clutter, motion blur, fast motion etc.<br />
<br />
== MaskRNN: Introduction ==<br />
Most techniques mentioned above don’t work directly on instance level segmentation of the objects through the video sequence. The above approaches focus on image segmentation on each frame and using additional information (mask propagation and optical flow) from the preceding frame perform predictions for the current frame. To address the instance level segmentation problem, MaskRNN proposes a framework where the salient objects are tracked and segmented by capturing the temporal information in the video sequence using a recurrent neural network.<br />
<br />
== MaskRNN: Overview ==<br />
In a video sequence <math>I = \{I_1, I_2, …, I_T\}</math>, the sequence of <math>T</math> frames are given as input to the network, where the video sequence contains <math>N</math> salient objects. The ground truth for the first frame <math>y_1^*</math> is also provided for <math>N</math> salient objects.<br />
In this paper, the problem is formulated as a time dependency problem and using a recurrent neural network, the prediction of the previous frame influences the prediction of the next frame. The approach also computes the optical flow between frames (optical flow is the apparent motion of objects between two consecutive frames in the form of a 2D vector field representing the displacement in brightness patterns for each pixel, apparent because it depends on the relative motion between the observer and the scene) and uses that as the input to the neural network. The optical flow is also used to align the output of the predicted mask. “The warped prediction, the optical flow itself, and the appearance of the current frame are then used as input for <math>N</math> deep nets, one for each of the <math>N</math> objects.”[1 - MaskRNN] Each deep net is a made of an object localization network and a binary segmentation network. The binary segmentation network is used to generate the segmentation mask for an object. The object localization network is used to alleviate outliers from the predictions. The final prediction of the segmentation mask is generated by merging the predictions of the 2 networks. For <math>N</math> objects, there are N deep nets which predict the mask for each salient object. The predictions are then merged into a single prediction using an <math>\text{argmax}</math> operation at test time.<br />
<br />
== MaskRNN: Multiple Instance Level Segmentation ==<br />
<br />
[[File:2ObjectSeg.jpg | 850px]]<br />
<br />
Image segmentation requires producing a pixel level segmentation mask and this can become a multi-class problem. Instead, using the approach from [2- Mask R-CNN] this approach is converted into a multiple binary segmentation problem. A separate segmentation mask is predicted separately for each salient object and thus we get a binary segmentation problem. The binary segments are combined using an <math>\text{argmax}</math> operation where each pixel is assigned to the object containing the largest predicted probability.<br />
<br />
=== MaskRNN: Binary Segmentation Network ===<br />
<br />
[[File:MaskRNNDeepNet.jpg | 850px]]<br />
<br />
The above picture shows a single deep net employed for predicting the segment mask for one salient object in the video frame. The network consists of 2 networks: binary segmentation network and object localization network. The binary segmentation network is split into two streams: appearance and flow stream. The input of the appearance stream is the RGB frame at time t and the wrapped prediction of the binary segmentation mask from time <math>t-1</math>. The wrapping function uses the optical flow between frame <math>t-1</math> and frame <math>t</math> to generate a new binary segmentation mask for frame <math>t</math>. The input to the flow stream is the concatenation of the optical flow magnitude between frames <math>t-1</math> to <math>t</math> and frames <math>t</math> to <math>t+1</math> and the wrapped prediction of the segmentation mask from frame <math>t-1</math>. The magnitude of the optical flow is replicated into an RBG format before feeding it to the flow stream. The network architecture closely resembles a VGG-16 network without the pooling or fully connected layers at the end. The fully connected layers are replaced with convolutional and bilinear interpolation upsampling layers which are then linearly combined to form a feature representation that is the same size of the input image. This feature representation is then used to generate a binary segment mask. This technique is borrowed from the Fully Convolutional Network mentioned above. The output of the flow stream and the appearance stream is linearly combined and sigmoid function is applied to the result to generate binary mask for ith object. All parts of the network are fully differentiable and thus it can be fully trained in every pass.<br />
<br />
=== MaskRNN: Object Localization Network: ===<br />
Using a similar technique to the Fast-RCNN method of object localization, where the region of interest (RoI) pooling of the features of the region proposals (i.e. the bounding box proposals here) is performed and passed through fully connected layers to perform regression, the Object localization network generates a bounding box of the salient object in the frame. This bounding box is enlarged by a factor of 1.25 and combined with the output of binary segmentation mask. Only the segment mask available in the bounding box is used for prediction and the pixels outside of the bounding box are marked as zero. MaskRNN uses the convolutional feature output of the appearance stream as the input to the RoI-pooling layer to generate the predicted bounding box. A pixel is classified as foreground if it is both predicted to be in the foreground by the binary segmentation net and within the enlarged estimated bounding box from the object localization net.<br />
<br />
=== Training and Finetuning ===<br />
For training the network depicted in Figure 1, backpropagation through time is used in order to preserve the recurrence relationship connecting the frames of the video sequence. Predictive performance is further improved by following the algorithm for semi-supervised setting for video object segmentation with fine-tuning achieved by using the first frame segmentation mask of the ground truth. In this way, the network is further optimized using the ground truth data.<br />
<br />
== MaskRNN: Implementation Details ==<br />
=== Offline Training ===<br />
The deep net is first trained offline on a set of static images. The ground truth is randomly perturbed locally to generate the imperfect mask from frame <math>t-1</math>. Two different networks are trained offline separately for DAVIS-2016 and DAVIS-2017 datasets for a fair evaluation of both datasets. After both the object localization net and binary segmentation networks have trained, the temporal information in the network is used to further improve the segmented prediction results. Because of GPU memory constraints, the RNN is only able to backpropagate the gradients back 7 frames and learn long-term temporal information. <br />
<br />
For optical flow, a pre-trained flowNet2.0 is used to compute the optical flow between frames. (A flowNet (Dosovitskiy 2015) is a deep neural network trained to predict optical flow. The simplest form of flowNet has an architecture consisting of two parts. The first part accepts the two images between which the optical flow is to be computed as input, as applies a sequence of convolution and max-pooling operations, as in a standard convolutional neural network. In the second part, repeated up-convolution operations are applied, increasing the dimensions of the feature-maps. Besides the output of the previous upconvolution, each upconvolution is also fed as input the output of the corresponding down-convolution from the first part of the network. Thus part of the architecture resembles that of a U-net (Ronneberger, 2015). The output of the network is the predicted optical flow. ) <br />
<br />
=== Online Finetuning ===<br />
The deep nets (without the RNN) are then fine-tuned during test time by online training the networks on the ground truth of the first frame and some augmentations of the first frame data. The learning rate is set to <math>10^{-5}</math> for online training for 200 iterations and the learning rate is gradually decayed over time. Data augmentation techniques similar to those in offline training, namely random resizing, rotating, cropping and flipping is applied. Also, it should be noted that the RNN is ''not'' employed during online finetuning since only a single frame of training data is available.<br />
<br />
== MaskRNN: Experimental Results ==<br />
=== Evaluation Metrics ===<br />
There are 3 different techniques for performance analysis for Video Object Segmentation techniques:<br />
<br />
1. Region Similarity (Jaccard Index): Region similarity or Intersection-over-union is used to capture precision of the area covered by the prediction segmentation mask compared to the ground truth segmentation mask. It calculates the average across all frames of the dataset. This is particularly challenging for small sized foreground objects.<br />
<br />
\begin{equation}<br />
IoU = \frac{|M \cap G|}{|M| + |G| - |M \cap G|} <br />
\label{equation:Jaccard}<br />
\end{equation}<br />
<br />
2. Contour Accuracy (F-score): This metric measures the accuracy in the boundary of the predicted segment mask and the ground truth segment mask, by calculating the the precision and the recall of the two sets of points on the contours of the ground truth segment and the output segment via a bipartite graph matching. It is a measure of accurate delineation of the foreground objects. <br />
<br />
[[File:Fscore.jpg | 200px|center]]<br />
<br />
3. Temporal Stability : This estimates the degree of deformation needed to transform the segmentation masks from one frame to the next and is measured by the dissimilarity of the set of points on the contours of the segmentation between two adjacent frames.<br />
<br />
Region similarity measures the true segmented area in the prediction, while Contour Accuracy measures the accuracy of the contours/segmented mask boundary.<br />
<br />
=== Ablation Study ===<br />
<br />
The ablation study summarized how the different components contributed to the algorithm evaluated on DAVIS-2016 and DAVIS-2017 datasets.<br />
<br />
[[File:MaskRNNTable2.jpg | 700px|center]]<br />
<br />
The above table presents the contribution of each component of the network to the final prediction score. Online fine-tuning improves the performance by a large margin, as the network becomes adjusted to the appearance of the specific object being tracked. Addition of RNN/Localization Net and FStream all seem to positively affect the performance of the deep net. The FStream provides information on motion boundaries which help in videos with cluttered backgrounds, the RNN provides more consistent segmentation masks over time. The localization net has a more ambiguous effect on the network; adding the bounding box regression loss decreases the performance of the segmentation net but applying the bounding box to restrict the segmentation mask improves the results over those achieved by only using the segmentation net. In other words, the localization net should only be used in conjunction with the segmentation net while the segmentation net can be used by itself.<br />
<br />
=== Quantitative Evaluation ===<br />
<br />
The authors use DAVIS-2016, DAVIS-2017 and Segtrack v2 to compare the performance of the proposed approach to other methods based on foreground-background video object segmentation and multiple instance-level video object segmentation.<br />
<br />
[[File:MaskRNNTable3.jpg | 700px]]<br />
<br />
The above table shows the results for contour accuracy mean and region similarity. The MaskRNN method seems to outperform all previously proposed methods. The performance gain is significant by employing a Recurrent Neural Network for learning recurrence relationship and using a object localization network to improve prediction results.<br />
<br />
The following table shows the improvements in the state of the art achieved by MaskRNN on the DAVIS-2017 and the SegTrack v2 dataset.<br />
<br />
[[File:MaskRNNTable4.jpg | 700px]]<br />
<br />
=== Qualitative Evaluation ===<br />
The authors showed example qualitative results from the DAVIS and Segtrack datasets. <br />
<br />
Below are some success cases of object segmentation under complex motion, cluttered background, and/or multiple object occlusion.<br />
<br />
[[File:maskrnn_example.png | 700px]]<br />
<br />
Below are a few failure cases. The authors explain two reasons for failure: a) when similar objects of interest are contained in the frame (left two images), and b) when there are large variations in scale and viewpoint (right two images).<br />
<br />
[[File:maskrnn_example_fail.png | 700px]]<br />
<br />
== Conclusion ==<br />
In this paper a novel approach to instance level video object segmentation task is presented which performs better than current state of the art. The long-term recurrence relationship is learnt using an RNN. The object localization network is added to improve accuracy of the system. Due to the recurrent component and the combination of segmentation and localization nets, the approach takes advantage of the long-term temporal information and the location prior to improve the results. Using online fine-tuning the network is adjusted to predict better for the current video sequence.<br />
<br />
== Critique ==<br />
The paper provides a technique to track multiple objects in a video. The novelty is to add back-propagation through time to improve the tracking accuracy and using a localization network to remove any outliers in the segmented binary mask. However, the network architecture it too large and isn't able to run in real-time. There are N deep-Nets for N objects and each deep-Net contains 2 parallel VGG-16 convolutional networks.<br />
<br />
== Implementation ==<br />
<br />
The implementation of this paper was produced as part of the NIPS Paper Implementation Challenge. This implementation can be found at the following open source project: https://github.com/philferriere/tfvos.<br />
<br />
== References ==<br />
# Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE International Conference on Computer Vision. 2015.<br />
# Hu, Y., Huang, J., & Schwing, A. "MaskRNN: Instance level video object segmentation". Conference on Neural Information Processing Systems (NIPS). 2017<br />
# Ferriere, P. (n.d.). Semi-Supervised Video Object Segmentation (VOS) with Tensorflow. Retrieved March 20, 2018, from https://github.com/philferriere/tfvos<br />
# Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.<br />
# Lee, Yong Jae, Jaechul Kim, and Kristen Grauman. "Key-segments for video object segmentation." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.<br />
# Grundmann, Matthias, et al. "Efficient hierarchical graph-based video segmentation." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.<br />
# Li, Fuxin, et al. "Video segmentation by tracking many figure-ground segments." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.<br />
# Tsai, David, et al. "Motion coherent tracking using multi-label MRF optimization." International journal of computer vision 100.2 (2012): 190-202.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36461Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:12:11Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math> \theta_i' </math> denotes the model parameter after adapting to Task <math> T_i </math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36460Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:11:48Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math> \theta_i' </math> denotes the model parameter after adapting to Task <math> \mathCal{T}_i </math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36459Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:10:38Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math> \theta_i </math> denotes the model parameter after adapting to Task <math>\mathCal{T}_i</math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36458Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:10:26Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math> \theta_i^' </math> denotes the model parameter after adapting to Task <math>\mathCal{T}_i</math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36457Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:10:08Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math> \theta^'_i </math> denotes the model parameter after adapting to Task <math>\mathCal{T}_i</math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36456Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:09:41Z<p>F7xia: /* Model Agnostic Meta-Learning */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math> \theta </math> denotes the meta learning model parameter, and <math>\theta^'_i</math> denotes the model parameter after adapting to Task <math>\mathCal{T}_i</math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Continuous_Adaptation_via_Meta-Learning_in_Nonstationary_and_Competitive_Environments&diff=36455Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments2018-04-21T04:07:24Z<p>F7xia: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Typically, the basic goal of machine learning is to train a model to perform a task. In meta-learning, the goal is to train a model to perform the task of training a model to perform a task. Hence, in this case, the term "meta-Learning" has the exact meaning one would expect; the word "meta" has the precise function of introducing a layer of abstraction.<br />
<br />
The meta-learning task can be made more concrete by a simple example. Consider the CIFAR-100 classification task that we used for our data competition. We can alter this task from being a 100-class classification problem to a collection of 100 binary classification problems. The goal of meta-learning here is to design and train a single binary classifier for each class that will perform well on a randomly sampled task given a limited amount of training data for that specific task. In other words, we would like to train a model to perform the following procedure:<br />
<br />
# A task is sampled. The task is "Is X a dog?".<br />
# A small set of labeled training data is provided to the model. Each label is a boolean representing whether or not the corresponding image is a picture of a dog.<br />
# The model uses the training data to adjust itself to the specific task of checking whether or not an image is a picture of a dog.<br />
<br />
This example also highlights the intuition that the skill of sight is distinct and separable from the skill of knowing what a dog looks like.<br />
<br />
In this paper, a probabilistic framework for meta-learning is derived, then applied to tasks involving simulated robotic spiders. This framework generalizes the typical machine learning setup using Markov Decision Processes. This paper focuses on a multi-agent non-stationary environment which requires reinforcement learning (RL) agents to do a continuous adaptation in such an environment. Non-stationarity breaks the standard assumptions and requires agents to continuously adapt, both at training and execution time, in order to earn more rewards, hence the approach is to break this into a sequence of stationary tasks and present it as a multi-task learning problem.<br />
<br />
[[File:paper19_fig1.png|600px|frame|none|alt=Alt text| '''Figure 1'''. a) Illustrates a probabilistic model for Model Agnostic Meta-Learning (MAML) in a multi-task RL setting, where the tasks <math>T</math>, policies <math>\pi</math>, and trajectories <math>\tau</math> are all random variables with dependencies encoded in the edges of a given graph. b) The proposed extension to MAML suitable for continuous adaptation to a task changing dynamically due to non-stationarity of the environment. The distribution of tasks is represented by a Markov chain, whereby policies from a previous step are used to construct a new policy for the current step. c) The computation graph for the meta-update from <math>\phi_i</math> to <math>\phi_{i+1}</math>. Boxes represent replicas of the policy graphs with the specified parameters. The model is optimized using truncated backpropagation through time starting from <math>L_{T_{i+1}}</math>.]]<br />
<br />
= Background =<br />
== Markov Decision Process (MDP) ==<br />
A MDP is defined by the tuple <math>(S,A,P,r,\gamma)</math>, where S is a set of states, A is a set of actions, P is the transition probability distribution, r is the reward function, and <math>\gamma</math> is the discount factor. More information ([https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf here]).<br />
<br />
<br />
= Model Agnostic Meta-Learning =<br />
<br />
An initial framework for meta-learning is given in "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al, 2017):<br />
<br />
"In our approach, the parameters of<br />
the model are explicitly trained such that a small<br />
number of gradient steps with a small amount<br />
of training data from a new task will produce<br />
good generalization performance on that task" (Finn et al, 2017).<br />
<br />
[[File:MAML.png | 500px]]<br />
<br />
Where <math>\theta</math> denotes the meta learning model parameter, and <math>\theta^'_i</math> denotes the model parameter after adapting to Task <math>\mathCal{T}_i</math><br />
<br />
In this training algorithm, the parameter vector <math>\theta</math> belonging to the model <math>f_{\theta}</math> is trained such that the meta-objective function <math>\mathcal{L} (\theta) = \sum_{\tau_i \sim P(\tau)} \mathcal{L}_{\tau_i} (f_{\theta_i' }) </math> is minimized. The sum in the objective function is over a sampled batch of training tasks. <math>\mathcal{L}_{\tau_i} (f_{\theta_i'})</math> is the training loss function corresponding to the <math>i^{th}</math> task in the batch evaluated at the model <math>f_{\theta_i'}</math>. The parameter vector <math>\theta_i'</math> is obtained by updating the general parameter <math>\theta</math> using the loss function <math>\mathcal{L}_{\tau_i}</math> and set of K training examples specific to the <math>i^{th}</math> task. Note that in alternate versions of this algorithm, additional testing sets are sampled from <math>\tau_i</math> and used to update <math>\theta</math> using testing loss functions instead of training loss functions.<br />
<br />
One of the important difference between this algorithm and more typical fine-tuning methods is that <math>\theta</math> is explicitly trained to be easily adjusted to perform well on different tasks rather than perform well on any specific tasks then fine tuned as the environment changes. (Sutton et al., 2007). In essence, the model is trained so that gradient steps are highly productive at adapting the model parameters to a new enviroment.<br />
<br />
= Probabilistic Framework for Meta-Learning =<br />
<br />
This paper puts the meta-learning problem into a Markov Decision Process (MDP) framework common to RL, see Figure 1a. Instead of training examples <math>\{(x, y)\}</math>, we have trajectories <math>\tau = (x_0, a_1, x_1, R_1, a_2,x_2,R_2, ... a_H, x_H, R_H)</math>. A trajectory is sequence of states/observations <math>x_t</math>, actions <math>a_t</math> and rewards <math>R_t</math> that is sampled from a task <math> T </math> according to a policy <math>\pi_{\theta}</math>. Included with said task is a method for assigning loss values to trajectories <math>L_T(\tau)</math> which is typically the negative cumulative reward. A policy is a deterministic function that takes in a state and returns an action. Our goal here is to train a policy <math>\pi_{\theta}</math> with parameter vector <math>\theta</math>. This is analougous to training a function <math>f_{\theta}</math> that assigns labels <math>y</math> to feature vectors <math>x</math>. More precisely we have the following definitions:<br />
<br />
* <math>\tau = (x_0, a_1, x_1, R_1, x_2, ... a_H, x_H, R_H)</math> trajectories.<br />
* <math>T :=(L_T, P_T(x), P_T(x_t | x_{t-1}, a_{t-1}), H )</math> (A Task)<br />
* <math>D(T)</math> : A distribution over tasks.<br />
* <math>L_T</math>: A loss function for the task T that assigns numeric loss values to trajectories.<br />
* <math>P_T(x), P_T(x_t | x_{t-1}, a_{t-1})</math>: Probability measures specifying the markovian dynamics of the observations <math>x_t</math><br />
* <math>H</math>: The horizon of the MDP. This is a fixed natural number specifying the lengths of the tasks trajectories.<br />
<br />
The paper goes further to define a Markov dynamic for sequences of tasks as shown in Figure 1b. Thus the policy that we would like to meta learn <math>\pi_{\theta}</math>, after being exposed to a sample of K trajectories <math>\tau_\theta^{1:K}</math> from the task <math>T_i</math>, should produce a new policy <math>\pi_{\phi}</math> that will perform well on the next task <math>T_{i+1}</math>. Thus we seek to minimize the following expectation:<br />
<br />
<math>\mathrm{E}_{P(T_0), P(T_{i+1} | T_i)}\bigg(\sum_{i=1}^{l} \mathcal{L}_{T_i, T_{i+1}}(\theta)\bigg)</math>, <br />
<br />
where <math>\mathcal{L}_{T_i, T_{i + 1}}(\theta) := \mathrm{E}_{\tau_{i, \theta}^{1:K} } \bigg( \mathrm{E}_{\tau_{i+1, \phi}}\Big( L_{T_{i+1}}(\tau_{i+1, \phi} | \tau_{i, \theta}^{1:K}, \theta) \Big) \bigg) </math> and <math>l</math> is the number of tasks.<br />
<br />
The meta-policy <math>\pi_{\theta}</math> is trained and then adapted at test time using the following procedures. The computational graph is given in Figure 1c.<br />
<br />
[[File:MAML2.png | 800px]]<br />
<br />
The mathematics of calculating loss gradients is omitted.<br />
<br />
= Training Spiders to Run with Dynamic Handicaps (Robotic Locomotion in Non-Stationary Environments) =<br />
<br />
The authors used the MuJoCo physics simulator to create a simulated environment where robotic spiders with 6 legs are faced with the task of running due east as quickly as possible. The robotic spider observes the location and velocity of its body, and the angles and velocities of its legs. It interacts with the environment by exerting torque on the joints of its legs. Each leg has two joints, the joint closer to the body rotates horizontally while the joint farther from the body rotates vertically. The environment is made non-stationary by gradually paralyzing two legs of the spider across training and testing episodes. This allows the agent to adapt to new environments in each episode.<br />
Putting this example into the above probabilistic framework yields:<br />
<br />
* <math>T_i</math>: The task of walking east with the torques of two legs scaled by <math> (i-1)/6 </math><br />
* <math>\{T_i\}_{i=1}^{7}</math>: A sequence of tasks with the same two legs handicapped in each task. Note there are 15 different ways to choose such legs resulting in 15 sequences of tasks. 12 are used for training and 3 for testing.<br />
* A Markov Descision process composed of<br />
** Observations <math> x_t </math> containing information about the state of the spider.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the spiders legs.<br />
** Rewards <math> R_t </math> corresponding to the speed at which the spider is moving east.<br />
<br />
Three differently structured policy neural networks are trained in this set up using both meta-learning and three different previously developed adaption methods.<br />
<br />
At testing time, the spiders following meta learned policies initially perform worse than the spiders using non-adaptive policies. However, by the third episode (<math> i=3 </math>), the meta-learners perform on par. And by the sixth episode, when the selected legs are mostly immobile, the meta-learners significantly out perform. These results can be seen in the graphs below.<br />
<br />
[[File:locomotion_results.png | 800px]]<br />
<br />
= Training Spiders to Fight Each Other (Adversarial Meta-Learning) =<br />
<br />
The authors created an adversarial environment called RoboSumo where pairs of agents with 4 (named Ants), 6 (named Bugs),or 8 legs (named spiders) sumo wrestle. The agents observe the location and velocity of their bodies and the bodies of their opponent, the angles and velocities of their legs, and the forces being exerted on them by their opponent (equivalent of tactile sense). The game is organized into episodes and rounds. Episodes are single wrestling matches with 500 timesteps and win/lose/draw outcomes. Agents win by pushing their opponent out of the ring or making their opponent's body touch the ground. An episode results in a draw when neither of these things happen after 500 timesteps. Rounds are batches of episodes. Rounds have possible outcomes win, lose, and draw that are decided based on majority of episodes won. K rounds will be fought. Both agents may update their policies between rounds. The agent that wins the majority of rounds is deemed the winner of the game.<br />
<br />
== Setup ==<br />
Similar to the Robotic locomotion example, this game can be phrased in terms of the RL MDP framework.<br />
<br />
* <math>T_i</math>: The task of fighting a round.<br />
* <math>\{T_i\}_{i=1}^{K}</math>: A sequence of rounds against the same opponent. Note that the opponent may update their policy between rounds but the anatomy of both wrestlers will be constant across rounds.<br />
* A Markov Descision process composed of<br />
** A horizon <math>H = 500*n</math> where <math>n</math> is the number of episodes per round.<br />
** Observations <math> x_t </math> containing information about the state of the agent and its opponent.<br />
** Actions <math> a_t </math> containing information about the torques to apply to the agents legs.<br />
** Rewards <math> R_t </math> rewards given to the agent based on its wrestling performance. <math>R_{500*n} = </math> +2000 if win episode, -2000 if lose, and -1000 if draw. A discount factor of <math> \gamma = 0.995 </math> is applied to the rewards.<br />
<br />
Note that the above reward set up is quite sparse. Therefore in order to encourage fast training, rewards are introduced at every time step for the following.<br />
* For staying close to the center of the ring.<br />
* For exerting force on the opponents body.<br />
* For moving towards the opponent.<br />
* For the distance of the opponent to the center of the ring.<br />
<br />
In addition to the sparse win/lose rewards, the following dense rewards are also introduced in the early training stages to encourage faster learning:<br />
* Quickly push the opponent outside - penalty proportional to the distance of there opponent from the center of the ring.<br />
* Moving towards the opponent - reward proportional to the velocity component towards the opponent.<br />
* Hit the opponent - reward proportional to square root of the total forces exerted on the opponent.<br />
* Control penalty - penalty denoted by <math> l_2 </math> on actions which lead to jittery/unnatural movements.<br />
<br />
<br />
This makes sense intuitively as these are reasonable goals for agents to explore when they are learning to wrestle.<br />
<br />
== Training ==<br />
The same combinations of policy networks and adaptation methods that were used in the locomotion example are trained and tested here. A family of non-adaptive policies are first trained via self-play and saved at all stages. Self-play simply means the two agents in the training environment use the same policy. All policy versions are saved so that agents of various skill levels can be sampled when training meta-learners. The weights of the different insects were calibrated such that the test win rate between two insects of differing anatomy, who have been trained for the same number of epochs via self-play, is close to 50%.<br />
<br />
[[File:weight_cal.png | 800px]]<br />
<br />
We can see in the above figure that the weight of the spider had to be increased by almost four times in order for the agents to be evenly matched.<br />
<br />
[[File:robosumo_results.png | 800px]]<br />
<br />
The above figure shows testing results for various adaptation strategies. The agent and opponent both start with the self-trained policies. The opponent uses all of its testing experience to continue training. The agent uses only the last 75 episodes to adapt its policy network. This shows that metal learners need only a limited amount of experience in order to hold their own against a constantly improving opponent.<br />
<br />
== Evaluating Adaptation Strategies ==<br />
To compare the proposed strategy against existing strategies, the authors leveraged the Robosumo task to have the different adaptation strategies compete against each other. The scoring metric used is TrueSkill, a metric very similar to ELO. Each strategy starts out with a fixed amount of points. When two different strategies compete the winner gains points and the loser loses points. The number of points gained and lost depend on the different in TrueSkill. A lower ranked opponent defeating a high ranked opponent will gain much more points than a high ranked opponent defeating a lower ranked one. It is assumed that agents skill levels are initially normally distributed with mean 25, and variance 25/3. 1000 matches are generated with each match consisting of 100-round iterated adaptation games. TrueSkill scores are updated after every match. The distribution of TrueSkill after playing 1000 matches are shown below:<br />
<br />
[[File:trueskill_results.PNG | 800px]]<br />
<br />
In summary recurrent policies are dominant, and the proposed meta-update performs better than or at par with the other strategies.<br />
<br />
The authors also demonstrated the same result by starting with 1000 agents, uniformly distributed in terms of adaptation strategies, and randomly matching them against each other for several generations. The loser of each match is removed and replaced with a duplicate of the winner. After several generations, the total population consists mostly of the proposed strategy. This is shown in the figure below.<br />
<br />
[[File:natural_selection.png | 800px]]<br />
<br />
= Future Work =<br />
The authors noted that the meta-learning adaptation rule they proposed is similar to backpropagation through time with a unit time lag, so a potential area for future research would be to introduce fully-recurrent meta-updates based on the full interaction history with the environment. Secondly, the algorithm proposed involves computing second-order derivatives at training time (see Figure 1c), which resulted in much slower training processes compared to baseline models during experiments, so they suggested finding a method to utilize information from the loss function without explicit backpropagation to speed up computations. The authors also mention that their approach likely will not work well with sparse rewards. This is because the meta-updates, which use policy gradients, are very dependent on the reward signal. They mention that this is an issue they would like to address in the future. A potential solution they have outlined for this is to introduce auxiliary dense rewards which could enable meta-learning.<br />
<br />
= Sources =<br />
# Chelsea Finn, Pieter Abbeel, Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400v3 (2017).<br />
# Richard S Sutton, Anna Koop, and David Silver. On the role of tracking in stationary environments. In Proceedings of the 24th international conference on Machine learning, pp. 871–878. ACM, 2007.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Label-Free_Supervision_of_Neural_Networks_with_Physics_and_Domain_Knowledge&diff=36454Label-Free Supervision of Neural Networks with Physics and Domain Knowledge2018-04-21T04:04:06Z<p>F7xia: </p>
<hr />
<div>== Introduction ==<br />
The requirement of large amounts of labeled training data limits the applications of machine learning. Neural networks, in particular, require large amounts of labeled data to work (LeCun, Bengio, and Hinton 2015[1]). Humans are often able to instead learn from high level instructions for how a task should be performed, or what the final result should look like. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs?<br />
<br />
[[File:c433li-1.png|300px|center]]<br />
<br />
Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:<br />
* a reduction in the amount of work spent labeling, <br />
* an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.<br />
<br />
The primary contribution in the paper is to demonstrate how constraint learning can be used to train neural networks, and to explore how to learn useful feature representations from raw data while avoiding trivial, low entropy solutions.<br />
<br />
== Problem Setup ==<br />
In a traditional supervised learning setting, we are given a training set <math>D=\{(x_1, y_1), \cdots, (x_n, y_n)\}</math> of <math>n</math> training examples. Each example is a pair <math>(x_i,y_i)</math> formed by an instance <math>x_i \in X</math> and the corresponding output (label) <math>y_i \in Y</math>. The goal is to learn a function <math>f: X \rightarrow Y</math> mapping inputs to outputs. To quantify performance, a loss function <math>\ell:Y \times Y \rightarrow \mathbb{R}</math> is provided, and a mapping is found via <br />
<br />
<center><math> f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) </math></center><br />
<br />
where the optimization is over a pre-defined class of functions <math>\mathcal{F}</math> (hypothesis class). In our case, <math>\mathcal{F}</math> will be (convolutional) neural networks parameterized by their weights. The loss could be for example <math>\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]</math>. By restricting the space of possible functions specifying the hypothesis class <math>\mathcal{F}</math>, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in <math>\mathcal{F}</math>, incorporating a regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math>, and solving for <math> f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)</math>. Typically, the regularization term <math>R:\mathcal{F} \rightarrow \mathbb{R}</math> specifies a preference for "simpler" functions (Occam's razor) to prevent overfitting the model on the training data.<br />
<br />
The focus is on the set of problems/domains where the problem is a complex environment having a complex representation of the output space, for example mapping an input image to the height of an object(since this leads to a complex output space) rather than simple binary classification problem.<br />
<br />
In this paper, prior knowledge on the structure of the outputs is modeled by providing a weighted constraint function <math>g:X \times Y \rightarrow \mathbb{R}</math>, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels <math>y</math> to evaluate <math>f^*</math>, labels may not be necessary to discover <math>f^*</math>. If prior knowledge informs us that outputs of <math>f^*</math> have other unique properties among functions in <math>\mathcal{F}</math>, we may use these properties for training rather than direct examples <math>y</math>. <br />
<br />
Specifically, an unsupervised approach where the labels <math>y_i</math> are not provided to us is considered, where a necessary property of the output <math>g</math> is optimized instead.<br />
<center><math>\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) </math></center><br />
<br />
If the optimizing the above equation is sufficient to find <math>\hat{f}^*</math>, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.<br />
<br />
== Experiments ==<br />
<br />
In this paper, the author introduced three contexts to map from inputs to outputs, without providing direct examples of those outputs. In the first two experiments (tracking an object in free fall and tracking the position of a walking man), the author constructed mappings from an image to the location of an object it contains. Learning is made possible by exploiting structure that holds in images over time. In third experiment (detecting objects with causal relationships), the author mapped an image to two boolean variables describing whether or not the image contains two special objects. While learning exploits the unique causal semantics existing between these objects. <br />
<br />
=== Tracking an object in free fall ===<br />
In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The dataset used as released by the authors can be found at [3]. The goal is to obtain a regression network mapping from <math>{R^{\text{height} \times \text{width} \times 3}} \rightarrow \mathbb{R}</math>, where <math>\text{height}</math> and <math>\text{width}</math> are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. This network is trained as a structured prediction problem operating on a sequence of <math>N</math> images to produce a sequence of <math>N</math> heights, <math>\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each piece of data <math>x_i</math> will be a vector of images, <math>\mathbf{x}</math>.<br />
Rather than supervising the network with direct labels, <math>\mathbf{y} \in \mathbb{R}^N</math>, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of <math>a = -9.8 m / s^2</math>, and the plot of the object's height over time will form a parabola:<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center><br />
<br />
The idea is, given any trajectory of <math>N</math> height predictions, <math>f(\mathbf{x})</math>, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify <math>\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]</math>, the prediction produced by the fitted parabola is:<br />
<center><math> \text{argmin}_{v_0, y_0}\sum_i(y_i-y_0-v_0(i\Delta_t)-\frac{1}{2}a(i\Delta_t)^2) </math></center><br />
By the solution of ordinary least square estimation: <br />
<center><math> \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) </math></center><br />
<br />
where<br />
<center><br />
<math><br />
\mathbf{A} = <br />
\left[ {\begin{array}{*{20}c}<br />
\Delta t & 1 \\<br />
2\Delta t & 1 \\<br />
3\Delta t & 1 \\<br />
\vdots & \vdots \\<br />
N\Delta t & 1 \\<br />
\end{array} } \right]<br />
</math><br />
</center><br />
<br />
The constraint loss is then defined as<br />
<center><math>g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|</math></center><br />
<br />
Note that <math>\hat{y}</math> is not the ground truth labels. Because <math>g</math> is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover <math>f^*</math> up to an additive constant <math>C</math> (specifying what object height corresponds to 0).<br />
<br />
[[File:c433li-2.png|650px|center]]<br />
<br />
The data set is collected on a laptop webcam running at 10 frames per second (<math>\Delta t = 0.1s</math>). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of <math>N=5</math> contiguous frames. Images are resized to <math>56 \times 56</math> pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.<br />
<br />
Since scaling the <math>y_0</math> and <math>v_0</math> results in the same constraint loss <math>g</math>, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements. This method was used since the distance from the object to the camera could not be accurately recorded, and this distance is required to calculate the height in meters. This is not a bullet proof evaluation, and is discussed in further detail in the critique section. The results are compared to a supervised network trained with the labels to directly predict the height of the object in pixels. The supervised learning task is viewed as a substantially easier task. From this knowledge we can see from the table below that, under their evaluation criteria, the result performs well.<br />
<br />
==== Evaluation ====<br />
{| class="wikitable"<br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 12.1% || 94.5% || 90.1%<br />
|}<br />
<br />
=== Tracking the position of a walking man ===<br />
In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels <math>y \in \mathbb{R}</math> by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem <math>f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N</math>, and each training instances <math>x_i</math> are a vector of images, <math>\mathbf{x}</math>, being mapped to a sequence of predictions, <math>\mathbf{y}</math>. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.<br />
<br />
Given any sequence of correct outputs, <math>(\mathbf{y}_1, \ldots, \mathbf{y}_N)</math>, the modified sequence, <math>(\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)</math> (<math>\lambda, C \in \mathbb{R}</math>) will also satisfy the constant velocity constraint. In the worst case, when <math>\lambda = 0</math>, <math>f \equiv C</math>, and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.<br />
<br />
<center><math>h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))</math></center><br />
which seeks to maximize the standard deviation of the output, and<br />
<br />
<center><br />
<math>\begin{split}<br />
h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\<br />
& \text{max}(\text{ReLU}(0 - f(\mathbf{x})))<br />
\end{split}<br />
</math><br />
</center><br />
which limit the output to a fixed ranged <math>[0, 10]</math>, the final loss is thus:<br />
<br />
<center><br />
<math><br />
\begin{split}<br />
g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\<br />
& \gamma_1 * h_1(\mathbf{x}) <br />
\hphantom{\text{ }}+ \\<br />
& \gamma_2 * h_2(\mathbf{x})<br />
% h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\<br />
% & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y))<br />
\end{split}<br />
</math><br />
</center><br />
<br />
[[File:c433li-3.png|650px|center]]<br />
<br />
The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to <math>56 \times 56</math>. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to <math>\gamma_2 = 0.8</math> and the standard deviation bonus is set to <math>\gamma_1 = 0.6</math>.<br />
<br />
As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follow:<br />
==== Evaluation ====<br />
{| class="wikitable"<br />
|-<br />
! scope="col" | Method !! scope="col" | Random Uniform Output !! scope="col" | Supervised with Labels !! scope="col" | Approach in this Paper<br />
|-<br />
! scope="row" | Correlation <br />
| 45.9% || 80.5% || 95.4%<br />
|}<br />
Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available (as correlation on training data reached 99.8%).<br />
<br />
=== Detecting objects with causal relationships ===<br />
In the previous experiments, the authors explored options for incorporating constraints pertaining to dynamics equations in real-world phenomena, i.e., prior knowledge derived from elementary physics. In this experiment, the authors explore the possibilities of learning from logical constraints imposed on single images. More specifically, they ask whether it is possible to learn from causal phenomena.<br />
<br />
[[File:paper18_Experiment_3.png|400px|center]]<br />
<br />
Here, the authors provide images containing a stochastic collection of up to four characters: Peach, Mario, Yoshi, and Bowser, with each character having small appearance changes across frames due to rotation and reflection. Example images can be seen in Fig. (4). While the existence of objects in each frame is non-deterministic, the generating distribution encodes the underlying phenomenon that Mario will always appear whenever Peach appears. The aim is to create a pair of neural networks <math>f_1, f_2</math> for identifying Peach and Mario, respectively. The networks, <math>f_k : R^{height×width×3} → \{0, 1\}</math>, map the image to the discrete boolean variables, <math>y_1</math> and <math>y_2</math>. Rather than supervising with direct labels, the authors train the networks by constraining their outputs to have the logical relationship <math>y_1 ⇒ y_2</math>. This problem is challenging because the networks must simultaneously learn to recognize the characters and select them according to logical relationships. To avoid the trivial solution <math>y_1 \equiv 1, y_2 \equiv 1</math> on every image, three additional loss terms need to be added:<br />
<br />
<center><math> h_1(\mathbf{x}, k) = \frac{1}{M}\sum_i^M |Pr[f_k(\mathbf{x}) = 1] - Pr[f_k(\rho(\mathbf{x})) = 1]|, </math></center><br />
<br />
which forces rotational independence of the outputs in order to encourage the network to learn the existence, rather than location of objects, <br />
<br />
<center><math> h_2(\mathbf{x}, k) = -\text{std}_{i \in [1 \dots M]}(Pr[f_k(\mathbf{x}_i) = 1]), </math></center><br />
<br />
which seeks high variance outputs, and<br />
<br />
<center><br />
<math> h_3(\mathbf{x}, v) = \frac{1}{M}\sum_i^{M} (Pr[f(\mathbf{x}_i) = v] - \frac{1}{3} + (\frac{1}{3} - \mu_v))^2 \\<br />
\mu_{v} = \frac{1}{M}\sum_i^{M} \mathbb{1}\{v = \text{argmax}_{v' \in \{0, 1\}^2} Pr[f(\mathbf{x}) = v']\}. </math><br />
</center><br />
<br />
which seeks high entropy outputs. The final loss function then becomes: <br />
<br />
<center><br />
<math> \begin{split}<br />
g(\mathbf{x}) & = \mathbb{1}\{f_1(\mathbf{x}) \nRightarrow f_2(\mathbf{x})\} \hphantom{\text{ }} + \\<br />
& \sum_{k \in \{1, 2\}} \gamma_1 h_1(\mathbf{x}, k) + \gamma_2 h_2(\mathbf{x}, k) + <br />
\hspace{-0.7em} \sum_{v \neq \{1,0\}} \hspace{-0.7em} \gamma_3 * h_3(\mathbf{x}, v)<br />
\end{split}<br />
</math><br />
</center><br />
<br />
====Evaluation====<br />
<br />
The input images, shown in Figure 4, are 56 × 56 pixels. The authors used <math>\gamma_1 = 0.65, \gamma_2 = 0.65, \gamma_3 = 0.95</math>, and trained for 4,000 iterations. This experiment demonstrates that networks can learn from constraints that operate over discrete sets with potentially complex logical rules. Removing constraints will cause learning to fail. Thus, the experiment also shows that sophisticated sufficiency conditions can be key to success when learning from constraints.<br />
<br />
== Conclusion and Critique ==<br />
This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper is not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint, as in Deep Q learning (Mnih, Riedmiller et al. 2013[2]). In Deep Q-Learning (DQN) also uses a deep neural network which is trained with constraints just like this paper proposes.<br />
<center><math>Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}</math></center><br />
<br />
<br />
Also, the paper has a mistake where they quote the free fall equation as<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2</math></center><br />
which should be<br />
<center><math>\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2</math></center><br />
Although in this case it doesn't affect the result.<br />
<br />
<br />
For the evaluation of the experiments, correlation with ground truth was used as the metric to avoid the fact that the output can be scaled without affecting the constraint loss, which is fine if the network gives output of the same scale. However, it is possible that, and the network may give output of varying scale for different inputs, in which case, we have no confidence that the network has learnt correctly, although the learnt outcome may be correlated with ground truth strongly. In fact, to solve the scaling issue, an better approach is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two losses.<br />
<br />
With regards to the free fall experiment in particular, the authors apply a fixed acceleration model to create the constraint loss, aiming to have the network predict height. However, since they did not measure the true height of the object to create test labels, they evaluate using height in pixel space. They do not mention the accuracy of their camera calibration, nor what camera model was used to remove lens distortion. Since lens distortion tends to be worse at the extreme edges of the image, and that they tossed the pillow throughout the entire frame, it is likely that the ground truth labels were corrupted by distortion. If that is the case, it is possible the supervised network is actually performing worse, because it learning how to predict distorted (beyond a constant scaling factor) heights instead of the true height.<br />
<br />
These methods essentially boil down to generating approximate labels for training data using some knowledge of the dynamic that the labels should follow.<br />
<br />
Finally, this paper only picks examples where the constraints are easy to design, while in some more common tasks such as image classification, what kind of constraints are needed is not straightforward at all.<br />
<br />
== Related Work ==<br />
Constraint learning is a generalization of supervised learning that allows for more creative methods of supervision. For example, multiple-instance learning as proposed by (Dietterich, Lathrop, and Lozano-Pe ́rez 1997; Zhou and Xu 2007) allows for more efficient labeling by providing annotations over groups of images and learning to predict properties that hold over at least one input in a group. In rank learning, labels may given as orderings between inputs with the objective being to find an embedding of inputs that respects the ordering relation (Joachims 2002). Inductive logic programming approaches rely on logical formalisms and constraints to represent background knowledge and learn hypotheses from data (Muggleton and De Raedt 1994; De Raedt 2008; De Raedt and Kersting 2003). Various types of constraints have also been used extensively to guide unsupervised learning algorithms, such as clustering and dimensionality reduction techniques (Lee and Seung 2001; Basu, Davidson, and Wagstaff 2008; Zhi et al. 2013; Ermon et al. 2015).<br />
<br />
== References ==<br />
[1] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444.<br />
<br />
[2] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arxiv 1312.5602.<br />
<br />
[3] “Russell91/Labelfree.” GitHub, github.com/russell91/labelfree.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies&diff=36451stat946w18/Implicit Causal Models for Genome-wide Association Studies2018-04-21T03:48:25Z<p>F7xia: /* Implicit causal model in Edward */</p>
<hr />
<div>==Introduction and Motivation==<br />
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results. <br />
<br />
Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease. <br />
<br />
The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.<br />
<br />
[[File:gwas-example.jpg|500px|center]]<br />
<br />
This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.<br />
<br />
The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.<br />
<br />
For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).<br />
<br />
There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.<br />
<br />
==Implicit Causal Models==<br />
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.<br />
<br />
=== Probabilistic Causal Models ===<br />
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where<br />
<br />
[[File: eq1.1.png|800px|center]]<br />
<br />
Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，<br />
<br />
[[File: eqt1.png|800px|center]]<br />
<br />
The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.<br />
<br />
[[File: f_1.png|650px|center|]]<br />
<br />
<br />
An example of probabilistic causal models is additive noise model. <br />
<br />
[[File: eq2.1.png|800px|center]]<br />
<br />
<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as <br />
<br />
[[File: eqt2.png|800px|center]]<br />
<br />
where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.<br />
<br />
===Implicit Causal Models===<br />
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise <math>\epsilon</math> as input and outputs <math>x</math> given parameter <math>\theta</math>.<br />
<br />
<math><br />
x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)<br />
</math><br />
<br />
The causal diagram has changed to:<br />
<br />
[[File: f_2.png|650px|center|]]<br />
<br />
<br />
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description: <br />
<br />
[[File: theorem.png|650px|center|]]<br />
<br />
==Implicit Causal Models with Latent Confounders==<br />
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.<br />
<br />
===Causal Inference with a Latent Confounder===<br />
Similar to before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.<br />
<br />
The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,<br />
<br />
[[File: eqt4.png|800px|center]]<br />
<br />
<br />
The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well. <br />
<br />
The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
Note that the latent structure <math>p(z|x, y)</math> is assumed known.<br />
<br />
In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:<br />
<br />
'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)<br />
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.<br />
<br />
Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.<br />
<br />
===Implicit Causal Model with a Latent Confounder===<br />
This section is the algorithm and functions to implementing an implicit causal model for GWAS.<br />
<br />
====Generative Process of Confounders <math>z_n</math>.====<br />
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural. <br />
<br />
====Generative Process of SNPs <math>x_{nm}</math>.====<br />
Given SNP is coded for,<br />
<br />
[[File: SNP.png|300px|center]]<br />
<br />
The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.<br />
<br />
[[File: gpx.png|800px|center]]<br />
<br />
A SNP matrix looks like this:<br />
[[File: SNP_matrix.png|200px|center]]<br />
<br />
<br />
Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,<br />
<br />
[[File: gpxnn.png|800px|center]]<br />
<br />
This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.<br />
<br />
====Generative Process of Traits <math>y_n</math>.====<br />
Previously, each trait is modeled by a linear regression,<br />
<br />
[[File: gpy.png|800px|center]]<br />
<br />
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,<br />
<br />
[[File: gpynn.png|800px|center]]<br />
<br />
<br />
==Likelihood-free Variational Inference==<br />
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
could be reduces to <br />
<br />
[[File: lfvi1.png|800px|center]]<br />
<br />
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,<br />
<br />
[[File: lfvi2.png|700px|center]]<br />
<br />
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:<br />
[[File: em.png|800px|center]]<br />
<br />
==Empirical Study==<br />
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. <br />
Four methods were compared: <br />
<br />
* implicit causal model (ICM);<br />
* PCA with linear regression (PCA); <br />
* a linear mixed model (LMM); <br />
* logistic factor analysis with inverse regression (GCAT).<br />
<br />
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization. <br />
<br />
===Simulation Study===<br />
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. <br />
There are four datasets used in this simulation study: <br />
<br />
# HapMap [Balding-Nichols model]<br />
# 1000 Genomes Project (TGP) [PCA]<br />
#* Human Genome Diversity project (HGDP) [PCA]<br />
#* HGDP [Pritchard-Stephens-Donelly model] <br />
# A latent spatial position of individuals for population structure [spatial]<br />
<br />
<br />
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.<br />
<br />
[[File: table_1.png|650px|center|]]<br />
<br />
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.<br />
<br />
<br />
===Real-data Analysis===<br />
They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.<br />
<br />
[[File: table_2.png|650px|center|]]<br />
<br />
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.<br />
<br />
==Conclusion==<br />
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.<br />
<br />
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.<br />
<br />
The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.<br />
<br />
==Critique==<br />
This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.<br />
<br />
The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS. <br />
<br />
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.<br />
<br />
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.<br />
<br />
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.<br />
This could be a future work as well.<br />
<br />
==References==<br />
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.<br />
<br />
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.<br />
<br />
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.<br />
<br />
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.<br />
<br />
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.<br />
<br />
== Implicit causal model in Edward ==<br />
The author provides an example of an implicit causal model written in the Edward language.<br />
[[File: coddde.png|600px]]</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies&diff=36450stat946w18/Implicit Causal Models for Genome-wide Association Studies2018-04-21T03:47:34Z<p>F7xia: /* Implicit causal model in Edward */</p>
<hr />
<div>==Introduction and Motivation==<br />
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results. <br />
<br />
Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease. <br />
<br />
The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.<br />
<br />
[[File:gwas-example.jpg|500px|center]]<br />
<br />
This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.<br />
<br />
The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.<br />
<br />
For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).<br />
<br />
There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.<br />
<br />
==Implicit Causal Models==<br />
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.<br />
<br />
=== Probabilistic Causal Models ===<br />
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where<br />
<br />
[[File: eq1.1.png|800px|center]]<br />
<br />
Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，<br />
<br />
[[File: eqt1.png|800px|center]]<br />
<br />
The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.<br />
<br />
[[File: f_1.png|650px|center|]]<br />
<br />
<br />
An example of probabilistic causal models is additive noise model. <br />
<br />
[[File: eq2.1.png|800px|center]]<br />
<br />
<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as <br />
<br />
[[File: eqt2.png|800px|center]]<br />
<br />
where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.<br />
<br />
===Implicit Causal Models===<br />
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise <math>\epsilon</math> as input and outputs <math>x</math> given parameter <math>\theta</math>.<br />
<br />
<math><br />
x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)<br />
</math><br />
<br />
The causal diagram has changed to:<br />
<br />
[[File: f_2.png|650px|center|]]<br />
<br />
<br />
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description: <br />
<br />
[[File: theorem.png|650px|center|]]<br />
<br />
==Implicit Causal Models with Latent Confounders==<br />
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.<br />
<br />
===Causal Inference with a Latent Confounder===<br />
Similar to before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.<br />
<br />
The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,<br />
<br />
[[File: eqt4.png|800px|center]]<br />
<br />
<br />
The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well. <br />
<br />
The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
Note that the latent structure <math>p(z|x, y)</math> is assumed known.<br />
<br />
In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:<br />
<br />
'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)<br />
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.<br />
<br />
Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.<br />
<br />
===Implicit Causal Model with a Latent Confounder===<br />
This section is the algorithm and functions to implementing an implicit causal model for GWAS.<br />
<br />
====Generative Process of Confounders <math>z_n</math>.====<br />
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural. <br />
<br />
====Generative Process of SNPs <math>x_{nm}</math>.====<br />
Given SNP is coded for,<br />
<br />
[[File: SNP.png|300px|center]]<br />
<br />
The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.<br />
<br />
[[File: gpx.png|800px|center]]<br />
<br />
A SNP matrix looks like this:<br />
[[File: SNP_matrix.png|200px|center]]<br />
<br />
<br />
Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,<br />
<br />
[[File: gpxnn.png|800px|center]]<br />
<br />
This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.<br />
<br />
====Generative Process of Traits <math>y_n</math>.====<br />
Previously, each trait is modeled by a linear regression,<br />
<br />
[[File: gpy.png|800px|center]]<br />
<br />
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,<br />
<br />
[[File: gpynn.png|800px|center]]<br />
<br />
<br />
==Likelihood-free Variational Inference==<br />
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
could be reduces to <br />
<br />
[[File: lfvi1.png|800px|center]]<br />
<br />
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,<br />
<br />
[[File: lfvi2.png|700px|center]]<br />
<br />
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:<br />
[[File: em.png|800px|center]]<br />
<br />
==Empirical Study==<br />
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. <br />
Four methods were compared: <br />
<br />
* implicit causal model (ICM);<br />
* PCA with linear regression (PCA); <br />
* a linear mixed model (LMM); <br />
* logistic factor analysis with inverse regression (GCAT).<br />
<br />
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization. <br />
<br />
===Simulation Study===<br />
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. <br />
There are four datasets used in this simulation study: <br />
<br />
# HapMap [Balding-Nichols model]<br />
# 1000 Genomes Project (TGP) [PCA]<br />
#* Human Genome Diversity project (HGDP) [PCA]<br />
#* HGDP [Pritchard-Stephens-Donelly model] <br />
# A latent spatial position of individuals for population structure [spatial]<br />
<br />
<br />
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.<br />
<br />
[[File: table_1.png|650px|center|]]<br />
<br />
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.<br />
<br />
<br />
===Real-data Analysis===<br />
They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.<br />
<br />
[[File: table_2.png|650px|center|]]<br />
<br />
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.<br />
<br />
==Conclusion==<br />
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.<br />
<br />
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.<br />
<br />
The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.<br />
<br />
==Critique==<br />
This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.<br />
<br />
The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS. <br />
<br />
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.<br />
<br />
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.<br />
<br />
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.<br />
This could be a future work as well.<br />
<br />
==References==<br />
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.<br />
<br />
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.<br />
<br />
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.<br />
<br />
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.<br />
<br />
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.<br />
<br />
== Implicit causal model in Edward ==<br />
[[File: coddde.png|600px]]</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies&diff=36449stat946w18/Implicit Causal Models for Genome-wide Association Studies2018-04-21T03:47:22Z<p>F7xia: /* Implicit causal model in Edward */</p>
<hr />
<div>==Introduction and Motivation==<br />
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results. <br />
<br />
Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease. <br />
<br />
The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.<br />
<br />
[[File:gwas-example.jpg|500px|center]]<br />
<br />
This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.<br />
<br />
The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.<br />
<br />
For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).<br />
<br />
There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.<br />
<br />
==Implicit Causal Models==<br />
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.<br />
<br />
=== Probabilistic Causal Models ===<br />
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where<br />
<br />
[[File: eq1.1.png|800px|center]]<br />
<br />
Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，<br />
<br />
[[File: eqt1.png|800px|center]]<br />
<br />
The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.<br />
<br />
[[File: f_1.png|650px|center|]]<br />
<br />
<br />
An example of probabilistic causal models is additive noise model. <br />
<br />
[[File: eq2.1.png|800px|center]]<br />
<br />
<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as <br />
<br />
[[File: eqt2.png|800px|center]]<br />
<br />
where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.<br />
<br />
===Implicit Causal Models===<br />
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise <math>\epsilon</math> as input and outputs <math>x</math> given parameter <math>\theta</math>.<br />
<br />
<math><br />
x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)<br />
</math><br />
<br />
The causal diagram has changed to:<br />
<br />
[[File: f_2.png|650px|center|]]<br />
<br />
<br />
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description: <br />
<br />
[[File: theorem.png|650px|center|]]<br />
<br />
==Implicit Causal Models with Latent Confounders==<br />
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.<br />
<br />
===Causal Inference with a Latent Confounder===<br />
Similar to before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.<br />
<br />
The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,<br />
<br />
[[File: eqt4.png|800px|center]]<br />
<br />
<br />
The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well. <br />
<br />
The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
Note that the latent structure <math>p(z|x, y)</math> is assumed known.<br />
<br />
In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:<br />
<br />
'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)<br />
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.<br />
<br />
Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.<br />
<br />
===Implicit Causal Model with a Latent Confounder===<br />
This section is the algorithm and functions to implementing an implicit causal model for GWAS.<br />
<br />
====Generative Process of Confounders <math>z_n</math>.====<br />
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural. <br />
<br />
====Generative Process of SNPs <math>x_{nm}</math>.====<br />
Given SNP is coded for,<br />
<br />
[[File: SNP.png|300px|center]]<br />
<br />
The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.<br />
<br />
[[File: gpx.png|800px|center]]<br />
<br />
A SNP matrix looks like this:<br />
[[File: SNP_matrix.png|200px|center]]<br />
<br />
<br />
Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,<br />
<br />
[[File: gpxnn.png|800px|center]]<br />
<br />
This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.<br />
<br />
====Generative Process of Traits <math>y_n</math>.====<br />
Previously, each trait is modeled by a linear regression,<br />
<br />
[[File: gpy.png|800px|center]]<br />
<br />
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,<br />
<br />
[[File: gpynn.png|800px|center]]<br />
<br />
<br />
==Likelihood-free Variational Inference==<br />
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
could be reduces to <br />
<br />
[[File: lfvi1.png|800px|center]]<br />
<br />
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,<br />
<br />
[[File: lfvi2.png|700px|center]]<br />
<br />
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:<br />
[[File: em.png|800px|center]]<br />
<br />
==Empirical Study==<br />
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. <br />
Four methods were compared: <br />
<br />
* implicit causal model (ICM);<br />
* PCA with linear regression (PCA); <br />
* a linear mixed model (LMM); <br />
* logistic factor analysis with inverse regression (GCAT).<br />
<br />
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization. <br />
<br />
===Simulation Study===<br />
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. <br />
There are four datasets used in this simulation study: <br />
<br />
# HapMap [Balding-Nichols model]<br />
# 1000 Genomes Project (TGP) [PCA]<br />
#* Human Genome Diversity project (HGDP) [PCA]<br />
#* HGDP [Pritchard-Stephens-Donelly model] <br />
# A latent spatial position of individuals for population structure [spatial]<br />
<br />
<br />
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.<br />
<br />
[[File: table_1.png|650px|center|]]<br />
<br />
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.<br />
<br />
<br />
===Real-data Analysis===<br />
They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.<br />
<br />
[[File: table_2.png|650px|center|]]<br />
<br />
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.<br />
<br />
==Conclusion==<br />
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.<br />
<br />
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.<br />
<br />
The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.<br />
<br />
==Critique==<br />
This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.<br />
<br />
The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS. <br />
<br />
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.<br />
<br />
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.<br />
<br />
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.<br />
This could be a future work as well.<br />
<br />
==References==<br />
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.<br />
<br />
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.<br />
<br />
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.<br />
<br />
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.<br />
<br />
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.<br />
<br />
== Implicit causal model in Edward ==<br />
[[File: coddde.png|600px|center]]</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:coddde.png&diff=36448File:coddde.png2018-04-21T03:46:36Z<p>F7xia: </p>
<hr />
<div></div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Implicit_Causal_Models_for_Genome-wide_Association_Studies&diff=36447stat946w18/Implicit Causal Models for Genome-wide Association Studies2018-04-21T03:46:20Z<p>F7xia: </p>
<hr />
<div>==Introduction and Motivation==<br />
There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results. <br />
<br />
Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease. <br />
<br />
The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.<br />
<br />
[[File:gwas-example.jpg|500px|center]]<br />
<br />
This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function <math>f</math> and a noise <math>n</math>. For working simplicity, we usually assume <math>f</math> as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.<br />
<br />
The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.<br />
<br />
For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).<br />
<br />
There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.<br />
<br />
==Implicit Causal Models==<br />
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.<br />
<br />
=== Probabilistic Causal Models ===<br />
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise <math>\epsilon</math>, representing unknown background quantities which are jointly independent and global variable <math>\beta</math>, some function of this noise, where<br />
<br />
[[File: eq1.1.png|800px|center]]<br />
<br />
Each <math>\beta</math> and <math>x</math> is a function of noise; <math>y</math> is a function of noise and <math>x</math>，<br />
<br />
[[File: eqt1.png|800px|center]]<br />
<br />
The target is the causal mechanism <math>f_y</math> so that the causal effect <math>p(y|do(X=x),\beta)</math> can be calculated. <math>do(X=x)</math> means that we specify a value of <math>X</math> under the fixed structure <math>\beta</math>. By other paper’s work, it is assumed that <math>p(y|do(x),\beta) = p(y|x, \beta)</math>.<br />
<br />
[[File: f_1.png|650px|center|]]<br />
<br />
<br />
An example of probabilistic causal models is additive noise model. <br />
<br />
[[File: eq2.1.png|800px|center]]<br />
<br />
<math>f(.)</math> is usually a linear function or spline functions for nonlinearities. <math>\epsilon</math> is assumed to be standard normal, as well as <math>y</math>. Thus the posterior <math>p(\theta | x, y, \beta)</math> can be represented as <br />
<br />
[[File: eqt2.png|800px|center]]<br />
<br />
where <math>p(\theta)</math> is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.<br />
<br />
===Implicit Causal Models===<br />
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise <math>\epsilon</math> as input and outputs <math>x</math> given parameter <math>\theta</math>.<br />
<br />
<math><br />
x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)<br />
</math><br />
<br />
The causal diagram has changed to:<br />
<br />
[[File: f_2.png|650px|center|]]<br />
<br />
<br />
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description: <br />
<br />
[[File: theorem.png|650px|center|]]<br />
<br />
==Implicit Causal Models with Latent Confounders==<br />
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.<br />
<br />
===Causal Inference with a Latent Confounder===<br />
Similar to before, the interest is the causal effect <math>p(y|do(x_m), x_{-m})</math>. Here, the SNPs other than <math>x_m</math> is also under consideration. However, it is confounded by the unobserved confounder <math>z_n</math>. As a result, the standard inference method cannot be used in this case.<br />
<br />
The paper proposed a new method which include the latent confounders. For each subject <math>n=1,…,N</math> and each SNP <math>m=1,…,M</math>,<br />
<br />
[[File: eqt4.png|800px|center]]<br />
<br />
<br />
The mechanism for latent confounder <math>z_n</math> is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well. <br />
<br />
The posterior of <math>\theta</math> is needed to be calculate in order to estimate the mechanism <math>g_y</math> as well as the causal effect <math>p(y|do(x_m), x_{-m})</math>, so that it can be explained how changes to each SNP <math>X_m</math> cause changes to the trait <math>Y</math>.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
Note that the latent structure <math>p(z|x, y)</math> is assumed known.<br />
<br />
In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow <math>X_m → Y</math>. Why is this justified? This is answered below:<br />
<br />
'''Proposition 1'''. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior <math>p(θ | x, y)<br />
</math> provides a consistent estimator of the causal mechanism <math>f_y</math>.<br />
<br />
Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders <math>z_n</math>, and thus we can estimate the causal mechanism given each data point’s confounder <math>z_n</math>. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder <math>z_n</math> as there is an infinity of them.<br />
<br />
===Implicit Causal Model with a Latent Confounder===<br />
This section is the algorithm and functions to implementing an implicit causal model for GWAS.<br />
<br />
====Generative Process of Confounders <math>z_n</math>.====<br />
The distribution of confounders is set as standard normal. <math>z_n \in R^K</math> , where <math>K</math> is the dimension of <math>z_n</math> and <math>K</math> should make the latent space as close as possible to the true population structural. <br />
<br />
====Generative Process of SNPs <math>x_{nm}</math>.====<br />
Given SNP is coded for,<br />
<br />
[[File: SNP.png|300px|center]]<br />
<br />
The authors defined a <math>Binomial(2,\pi_{nm})</math> distribution on <math>x_{nm}</math>. And used logistic factor analysis to design the SNP matrix.<br />
<br />
[[File: gpx.png|800px|center]]<br />
<br />
A SNP matrix looks like this:<br />
[[File: SNP_matrix.png|200px|center]]<br />
<br />
<br />
Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,<br />
<br />
[[File: gpxnn.png|800px|center]]<br />
<br />
This renders the outputs to be a full <math>N*M</math> matrix due the the variables <math>w_m</math>, which act as principal component in PCA. Here, <math>\phi</math> has a standard normal prior distribution. The weights <math>w</math> and biases <math>\phi</math> are shared over the <math>m</math> SNPs and <math>n</math> individuals, which makes it possible to learn nonlinear interactions between <math>z_n</math> and <math>w_m</math>.<br />
<br />
====Generative Process of Traits <math>y_n</math>.====<br />
Previously, each trait is modeled by a linear regression,<br />
<br />
[[File: gpy.png|800px|center]]<br />
<br />
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,<br />
<br />
[[File: gpynn.png|800px|center]]<br />
<br />
<br />
==Likelihood-free Variational Inference==<br />
Calculating the posterior of <math>\theta</math> is the key of applying the implicit causal model with latent confounders.<br />
<br />
[[File: eqt5.png|800px|center]]<br />
<br />
could be reduces to <br />
<br />
[[File: lfvi1.png|800px|center]]<br />
<br />
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables <math>w_m</math> and <math>z_n</math> are all assumed to be Normal,<br />
<br />
[[File: lfvi2.png|700px|center]]<br />
<br />
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:<br />
[[File: em.png|800px|center]]<br />
<br />
==Empirical Study==<br />
The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. <br />
Four methods were compared: <br />
<br />
* implicit causal model (ICM);<br />
* PCA with linear regression (PCA); <br />
* a linear mixed model (LMM); <br />
* logistic factor analysis with inverse regression (GCAT).<br />
<br />
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization. <br />
<br />
===Simulation Study===<br />
Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. <br />
There are four datasets used in this simulation study: <br />
<br />
# HapMap [Balding-Nichols model]<br />
# 1000 Genomes Project (TGP) [PCA]<br />
#* Human Genome Diversity project (HGDP) [PCA]<br />
#* HGDP [Pritchard-Stephens-Donelly model] <br />
# A latent spatial position of individuals for population structure [spatial]<br />
<br />
<br />
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.<br />
<br />
[[File: table_1.png|650px|center|]]<br />
<br />
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when <math>a</math> is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.<br />
<br />
<br />
===Real-data Analysis===<br />
They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.<br />
<br />
[[File: table_2.png|650px|center|]]<br />
<br />
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.<br />
<br />
==Conclusion==<br />
This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.<br />
<br />
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.<br />
<br />
The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.<br />
<br />
==Critique==<br />
This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.<br />
<br />
The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS. <br />
<br />
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.<br />
<br />
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.<br />
<br />
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs.<br />
This could be a future work as well.<br />
<br />
==References==<br />
Tran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.<br />
<br />
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.<br />
<br />
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.<br />
<br />
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.<br />
<br />
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.<br />
<br />
== Implicit causal model in Edward ==<br />
[[File: coddde.png|800px|center]]</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36445stat946w18/Self Normalizing Neural Networks2018-04-21T03:26:04Z<p>F7xia: /* Initialization */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights <math>ω = \sum_{i = 1}^n w_i = 0 </math> and <math>ω = \sum_{i = 1}^n w_i^2 = 1 </math> , the authors initialize SNNs such that these constraints are fulfilled in expectation. They draw the weights from a Gaussian distribution with <math> E(wi) = 0 </math> and variance <math> Var(w_i) = 1/n </math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36444stat946w18/Self Normalizing Neural Networks2018-04-21T03:25:46Z<p>F7xia: /* Initialization */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights <math>ω = \sum_{i = 1}^n w_i = 0 </math> and <math>ω = \sum_{i = 1}^n w_i^2 = 1 </math> , the authors initialize SNNs such that these constraints are fulfilled in expectation. They draw the weights from a Gaussian distribution with <math> E(wi) = 0 </math> and variance <math> Var(w_i) = 1/n </math>. <br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36443stat946w18/Self Normalizing Neural Networks2018-04-21T03:24:15Z<p>F7xia: /* Implementation Details */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights <math>ω = \sum_{i = 1}^n w_i = 0 </math> and <math>ω = \sum_{i = 1}^n w_i^2 = 1 </math> (see above), we initialize SNNs such that these constraints are fulfilled in expectation. We draw the weights from a Gaussian distribution with E(wi) = 0 and variance Var(wi) = 1/n. Uniform and truncated Gaussian distributions with these moments led to networks with similar behavior. The “MSRA initialization” is similar since it uses zero mean and variance 2/n to initialize the weights [14]. The additional factor 2 counters the effect of rectified linear units.<br />
<br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36442stat946w18/Self Normalizing Neural Networks2018-04-21T03:23:38Z<p>F7xia: /* Initialization */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights <math>ω = \sum_{i = 1}^n w_i = 0 </math>and τ = ni=1 wi2 = 1 (see above), we initialize SNNs such that these constraints are fulfilled in expectation. We draw the weights from a Gaussian distribution with E(wi) = 0 and variance Var(wi) = 1/n. Uniform and truncated Gaussian distributions with these moments led to networks with similar behavior. The “MSRA initialization” is similar since it uses zero mean and variance 2/n to initialize the weights [14]. The additional factor 2 counters the effect of rectified linear units.<br />
<br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36441stat946w18/Self Normalizing Neural Networks2018-04-21T03:23:22Z<p>F7xia: /* Initialization */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights <math>ω = \sum_{i = 1}^n w_i = 0 and τ = ni=1 wi2 = 1 (see above), we initialize SNNs such that these constraints are fulfilled in expectation. We draw the weights from a Gaussian distribution with E(wi) = 0 and variance Var(wi) = 1/n. Uniform and truncated Gaussian distributions with these moments led to networks with similar behavior. The “MSRA initialization” is similar since it uses zero mean and variance 2/n to initialize the weights [14]. The additional factor 2 counters the effect of rectified linear units.<br />
<br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Self_Normalizing_Neural_Networks&diff=36440stat946w18/Self Normalizing Neural Networks2018-04-21T03:21:46Z<p>F7xia: /* Initialization */</p>
<hr />
<div>==Introduction and Motivation==<br />
<br />
While neural networks have been making a lot of headway in improving benchmark results and narrowing the gap with human-level performance, success has been fairly limited to visual and sequential processing tasks through advancements in convolutional network and recurrent network structures. Most data science competitions outside of these domains are still outperformed by algorithms such as gradient boosting and random forests. The traditional (densely connected) feed-forward neural networks (FNNs) are rarely used competitively, and when they do win on rare occasions, they have very shallow network architectures with just up to four layers [10].<br />
<br />
The authors, Klambauer et al., believe that what prevents FNNs from becoming more competitively useful is the inability to train a deeper FNN structure, which would allow the network to learn more levels of abstract representations. Primarily, the difficulty arises due to the instability of gradients in very deep FNNs leading to problems like gradient vanishing/explosion. To have a deeper network, oscillations in the distribution of activations need to be kept under control so that stable gradients can be obtained during training. Several techniques are available to normalize activations, including batch normalization [6], layer normalization [1] and weight normalization [8]. These methods work well with CNNs and RNNs, but not so much with FNNs because backpropagating through normalization parameters introduces additional variance to the gradients, and regularization techniques like dropout further perturb the normalization effect. CNNs and RNNs are less sensitive to such perturbations, presumably due to their weight sharing architecture, but FNNs do not have such a property and thus suffer from high variance in training errors, which hinders learning. Furthermore, the aforementioned normalization techniques involve adding external layers to the model and can slow down computation, which may already be slow when working with very deep FNNs. <br />
<br />
Therefore, the authors were motivated to develop a new FNN implementation that can achieve the intended effect of normalization techniques that works well with stochastic gradient descent and dropout. Self-normalizing neural networks (SNNs) are based on the idea of scaled exponential linear units (SELU), a new activation function introduced in this paper, whose output distribution is proved to converge to a fixed point, thus making it possible to train deeper networks.<br />
<br />
==Notations==<br />
<br />
As the paper (primarily in the supplementary materials) comes with lengthy proofs, important notations are listed first.<br />
<br />
Consider two fully-connected layers, let <math display="inline">x</math> denote the inputs to the second layer, then <math display="inline">z = Wx</math> represents the network inputs of the second layer, and <math display="inline">y = f(z)</math> represents the activations in the second layer.<br />
<br />
Assume that all activations from a lower layer <math display="inline">x_i</math>'s, <math display="inline">1 \leqslant i \leqslant n</math>, have the same mean <math display="inline">\mu := \mathrm{E}(x_i)</math> and the same variance <math display="inline">\nu := \mathrm{Var}(x_i)</math> and that each <math display="inline">y</math> has mean <math display="inline">\widetilde{\mu} := \mathrm{E}(y)</math> and variance <math display="inline">\widetilde{\nu} := \mathrm{Var}(y)</math>, then let <math display="inline">g</math> be the set of functions that maps <math display="inline">(\mu, \nu)</math> to <math display="inline">(\widetilde{\mu}, \widetilde{\nu})</math>. <br />
<br />
For the weight vector <math display="inline">w</math>, <math display="inline">n</math> times the mean of the weight vector is <math display="inline">\omega := \sum_{i = 1}^n \omega_i</math> and <math display="inline">n</math> times the second moment is <math display="inline">\tau := \sum_{i = 1}^{n} w_i^2</math>.<br />
<br />
==Key Concepts==<br />
<br />
===Self-Normalizing Neural-Net (SNN)===<br />
<br />
''A neural network is self-normalizing if it possesses a mapping <math display="inline">g: \Omega \rightarrow \Omega</math> for each activation <math display="inline">y</math> that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in <math display="inline">\Omega</math>. Furthermore, the mean and variance remain in the domain <math display="inline">\Omega</math>, that is <math display="inline">g(\Omega) \subseteq \Omega</math>, where <math display="inline">\Omega = \{ (\mu, \nu) | \mu \in [\mu_{min}, \mu_{max}], \nu \in [\nu_{min}, \nu_{max}] \}</math>. When iteratively applying the mapping <math display="inline">g</math>, each point within <math display="inline">\Omega</math> converges to this fixed point.''<br />
<br />
In other words, in SNNs, if the inputs from an earlier layer (<math display="inline">x</math>) already have their mean and variance within a predefined interval <math display="inline">\Omega</math>, then the activations to the next layer (<math display="inline">y = f(z = Wx)</math>) should remain within those intervals. This is true across all pairs of connecting layers as the normalizing effect gets propagated through the network, hence why the term self-normalizing. When the mapping is applied iteratively, it should draw the mean and variance values closer to a fixed point within <math display="inline">\Omega</math>, the value of which depends on <math display="inline">\omega</math> and <math display="inline">\tau</math> (recall that they are from the weight vector).<br />
<br />
We will design a FNN then construct a g that takes the mean and variance of each layer to those of the next and is a contraction mapping i.e. <math>g(\mu_i, \nu_i) = (\mu_{i+1}, \nu_{i+1}) \forall i </math>. It should be noted that although the g required in the SNN definition depends on <math display="inline">(\omega, \tau)</math> of an individual layer, the FNN that we construct will have the same values of <math display="inline">(\omega, \tau)</math> for each layer. Intuitively this definition can be interpreted as saying that the mean and variance of the final layer of a sufficiently deep SNN will not change when the mean and variance of the input data change. This is because the mean and variance are passing through a contraction mapping at each layer, converging to the mapping's fixed point.<br />
<br />
The activation function that makes an SNN possible should meet the following four conditions:<br />
<br />
# It can take on both negative and positive values, so it can normalize the mean;<br />
# It has a saturation region, so it can dampen variances that are too large;<br />
# It has a slope larger than one, so it can increase variances that are too small; and<br />
# It is a continuous curve, which is necessary for the fixed point to exist (see the definition of Banach fixed point theorem to follow).<br />
<br />
Commonly used activation functions such as rectified linear units (ReLU), sigmoid, tanh, leaky ReLUs and exponential linear units (ELUs) do not meet all four criteria, therefore, a new activation function is needed.<br />
<br />
===Scaled Exponential Linear Units (SELUs)===<br />
<br />
One of the main ideas introduced in this paper is the SELU function. As the name suggests, it is closely related to ELU [3],<br />
<br />
\[ \mathrm{elu}(x) = \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
but further builds upon it by introducing a new scale parameter $\lambda$ and proving the exact values that $\alpha$ and $\lambda$ should take on to achieve self-normalization. SELU is defined as:<br />
<br />
\[ \mathrm{selu}(x) = \lambda \begin{cases} x & x > 0 \\<br />
\alpha e^x - \alpha & x \leqslant 0<br />
\end{cases} \]<br />
<br />
SELUs meet all four criteria listed above - it takes on positive values when <math display="inline">x > 0</math> and negative values when <math display="inline">x < 0</math>, it has a saturation region when <math display="inline">x</math> is a larger negative value, the value of <math display="inline">\lambda</math> can be set to greater than one to ensure a slope greater than one, and it is continuous at <math display="inline">x = 0</math>. <br />
<br />
Figure 1 below gives an intuition for how SELUs normalize activations across layers. As shown, a variance dampening effect occurs when inputs are negative and far away from zero, and a variance increasing effect occurs when inputs are close to zero.<br />
<br />
[[File:snnf1.png|500px]]<br />
<br />
Figure 2 below plots the progression of training error on the MNIST and CIFAR10 datasets when training with SNNs versus FNNs with batch normalization at varying model depths. As shown, FNNs that adopted the SELU activation function exhibited lower and less variable training loss compared to using batch normalization, even as the depth increased to 16 and 32 layers.<br />
<br />
[[File:snnf2.png|600px]]<br />
<br />
=== Banach Fixed Point Theorem and Contraction Mappings ===<br />
<br />
The underlying theory behind SNNs is the Banach fixed point theorem, which states the following: ''Let <math display="inline">(X, d)</math> be a non-empty complete metric space with a contraction mapping <math display="inline">f: X \rightarrow X</math>. Then <math display="inline">f</math> has a unique fixed point <math display="inline">x_f \subseteq X</math> with <math display="inline">f(x_f) = x_f</math>. Every sequence <math display="inline">x_n = f(x_{n-1})</math> with starting element <math display="inline">x_0 \subseteq X</math> converges to the fixed point: <math display="inline">x_n \underset{n \rightarrow \infty}\rightarrow x_f</math>.''<br />
<br />
A contraction mapping is a function <math display="inline">f: X \rightarrow X</math> on a metric space <math display="inline">X</math> with distance <math display="inline">d</math>, such that for all points <math display="inline">\mathbf{u}</math> and <math display="inline">\mathbf{v}</math> in <math display="inline">X</math>: <math display="inline">d(f(\mathbf{u}), f(\mathbf{v})) \leqslant \delta d(\mathbf{u}, \mathbf{v})</math>, for a <math display="inline">0 \leqslant \delta \leqslant 1</math>.<br />
<br />
The easiest way to prove a contraction mapping is usually to show that the spectral norm [12] of its Jacobian is less than 1 [13], as was done for this paper.<br />
<br />
==Proving the Self-Normalizing Property==<br />
<br />
===Mean and Variance Mapping Function===<br />
<br />
<math display="inline">g</math> is derived under the assumption that <math display="inline">x_i</math>'s are independent but not necessarily having the same mean and variance [[#Footnotes |(2)]]. Under this assumption (and recalling earlier notation of <math display="inline">\omega</math> and <math display="inline">\tau</math>),<br />
<br />
\begin{align}<br />
\mathrm{E}(z = \mathbf{w}^T \mathbf{x}) = \sum_{i = 1}^n w_i \mathrm{E}(x_i) = \mu \omega<br />
\end{align}<br />
<br />
\begin{align}<br />
\mathrm{Var}(z) = \mathrm{Var}(\sum_{i = 1}^n w_i x_i) = \sum_{i = 1}^n w_i^2 \mathrm{Var}(x_i) = \nu \sum_{i = 1}^n w_i^2 = \nu\tau \textrm{ .}<br />
\end{align}<br />
<br />
When the weight terms are normalized, <math display="inline">z</math> can be viewed as a weighted sum of <math display="inline">x_i</math>'s. Wide neural net layers with a large number of nodes is common, so <math display="inline">n</math> is usually large, and by the Central Limit Theorem, <math display="inline">z</math> approaches a normal distribution <math display="inline">\mathcal{N}(\mu\omega, \sqrt{\nu\tau})</math>. <br />
<br />
Using the above property, the exact form for <math display="inline">g</math> can be obtained using the definitions for mean and variance of continuous random variables: <br />
<br />
[[File:gmapping.png|600px|center]]<br />
<br />
Analytical solutions for the integrals can be obtained as follows: <br />
<br />
[[File:gintegral.png|600px|center]]<br />
<br />
The authors are interested in the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math> as these are the parameters associated with the common standard normal distribution. The authors also proposed using normalized weights such that <math display="inline">\omega = \sum_{i = 1}^n = 0</math> and <math display="inline">\tau = \sum_{i = 1}^n w_i^2= 1</math> as it gives a simpler, cleaner expression for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> in the calculations in the next steps. This weight scheme can be achieved in several ways, for example, by drawing from a normal distribution <math display="inline">\mathcal{N}(0, \frac{1}{n})</math> or from a uniform distribution <math display="inline">U(-\sqrt{3}, \sqrt{3})</math>.<br />
<br />
At <math display="inline">\widetilde{\mu} = \mu = 0</math>, <math display="inline">\widetilde{\nu} = \nu = 1</math>, <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the constants <math display="inline">\lambda</math> and <math display="inline">\alpha</math> from the SELU function can be solved for - <math display="inline">\lambda_{01} \approx 1.0507</math> and <math display="inline">\alpha_{01} \approx 1.6733</math>. These values are used throughout the rest of the paper whenever an expression calls for <math display="inline">\lambda</math> and <math display="inline">\alpha</math>.<br />
<br />
===Details of Moment-Mapping Integrals ===<br />
Consider the moment-mapping integrals:<br />
\begin{align}<br />
\widetilde{\mu} & = \int_{-\infty}^\infty \mathrm{selu} (z) p_N(z; \mu \omega, \sqrt{\nu \tau})dz\\<br />
\widetilde{\nu} & = \int_{-\infty}^\infty \mathrm{selu} (z)^2 p_N(z; \mu \omega, \sqrt{\nu \tau}) dz-\widetilde{\mu}^2.<br />
\end{align}<br />
<br />
The equation for <math display="inline">\widetilde{\mu}</math> can be expanded as <br />
\begin{align}<br />
\widetilde{\mu} & = \frac{\lambda}{2}\left( 2\alpha\int_{-\infty}^0 (\exp(z)-1) p_N(z; \mu \omega, \sqrt{\nu \tau})dz +2\int_{0}^\infty z p_N(z; \mu \omega, \sqrt{\nu \tau})dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha \frac{1}{\sqrt{2\pi\tau\nu}} \int_{-\infty}^0 (\exp(z)-1) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
&= \frac{\lambda}{2}\left( 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(z) \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz - 2 \alpha\frac{1}{\sqrt{2\pi\tau\nu}}\int_{-\infty}^0 \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 ) dz +2\frac{1}{\sqrt{2\pi\tau\nu}}\int_{0}^\infty z \exp(\frac{-1}{2\tau \nu} (z-\mu \omega)^2 dz \right)\\<br />
\end{align}<br />
<br />
The first integral can be simplified via the substituiton<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega -\tau \nu).<br />
\end{align}<br />
While the second and third can be simplified via the substitution<br />
\begin{align}<br />
q:= \frac{1}{\sqrt{2\tau \nu}}(z-\mu \omega ).<br />
\end{align}<br />
Using the definitions of <math display="inline">\mathrm{erf}</math> and <math display="inline">\mathrm{erfc}</math> then yields the result of the previous section.<br />
<br />
===Self-Normalizing Property Under Normalized Weights===<br />
<br />
Assuming the the weights normalized with <math display="inline">\omega=0</math> and <math display="inline">\tau=1</math>, it is possible to calculate the exact value for the spectral norm [12] of <math display="inline">g</math>'s Jacobian around the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>, which turns out to be <math display="inline">0.7877</math>. Thus, at initialization, SNNs have a stable and attracting fixed point at <math display="inline">(0, 1)</math>, which means that when <math display="inline">g</math> is applied iteratively to a pair <math display="inline">(\mu_{new}, \nu_{new})</math>, it should draw the points closer to <math display="inline">(0, 1)</math>. The rate of convergence is determined by the spectral norm [12], whose value depends on <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math>.<br />
<br />
[[File:paper10_fig2.png|600px|frame|none|alt=Alt text|The figure illustrates, in the scenario described above, the mapping <math display="inline">g</math> of mean and variance <math display="inline">(\mu, \nu)</math> to <math display="inline">(\mu_{new}, \nu_{new})</math>. The arrows show the direction <math display="inline">(\mu, \nu)</math> is mapped by <math display="inline">g: (\mu, \nu)\mapsto(\mu_{new}, \nu_{new})</math>. One can clearly see the fixed point mapping <math display="inline">g</math> is at <math display="inline">(0, 1)</math>.]]<br />
<br />
===Self-Normalizing Property Under Unnormalized Weights===<br />
<br />
As weights are updated during training, there is no guarantee that they would remain normalized. The authors addressed this issue through the first key theorem presented in the paper, which states that a fixed point close to (0, 1) can still be obtained if <math display="inline">\mu</math>, <math display="inline">\nu</math>, <math display="inline">\omega</math> and <math display="inline">\tau</math> are restricted to a specified range. <br />
<br />
Additionally, there is no guarantee that the mean and variance of the inputs would stay within the range given by the first theorem, which led to the development of theorems #2 and #3. These two theorems established an upper and lower bound on the variance of inputs if the variance of activations from the previous layer are above or below the range specified, respectively. This ensures that the variance would not explode or vanish after being propagated through the network.<br />
<br />
The theorems come with lengthy proofs in the supplementary materials for the paper. High-level proof sketches are presented here.<br />
<br />
====Theorem 1: Stable and Attracting Fixed Points Close to (0, 1)====<br />
<br />
'''Definition:''' We assume <math display="inline">\alpha = \alpha_{01}</math> and <math display="inline">\lambda = \lambda_{01}</math>. We restrict the range of the variables to the domain <math display="inline">\mu \in [-0.1, 0.1]</math>, <math display="inline">\omega \in [-0.1, 0.1]</math>, <math display="inline">\nu \in [0.8, 1.5]</math>, and <math display="inline">\tau \in [0.9, 1.1]</math>. For <math display="inline">\omega = 0</math> and <math display="inline">\tau = 1</math>, the mapping has the stable fixed point <math display="inline">(\mu, \nu) = (0, 1</math>. For other <math display="inline">\omega</math> and <math display="inline">\tau</math>, g has a stable and attracting fixed point depending on <math display="inline">(\omega, \tau)</math> in the <math display="inline">(\mu, \nu)</math>-domain: <math display="inline">\mu \in [-0.03106, 0.06773]</math> and <math display="inline">\nu \in [0.80009, 1.48617]</math>. All points within the <math display="inline">(\mu, \nu)</math>-domain converge when iteratively applying the mapping to this fixed point.<br />
<br />
'''Proof:''' In order to show the the mapping <math display="inline">g</math> has a stable and attracting fixed point close to <math display="inline">(0, 1)</math>, the authors again applied Banach's fixed point theorem, which states that a contraction mapping on a nonempty complete metric space that does not map outside its domain has a unique fixed point, and that all points in the <math display="inline">(\mu, \nu)</math>-domain converge to the fixed point when <math display="inline">g</math> is iteratively applied. <br />
<br />
The two requirements are proven as follows:<br />
<br />
'''1. g is a contraction mapping.'''<br />
<br />
For <math display="inline">g</math> to be a contraction mapping in <math display="inline">\Omega</math> with distance <math display="inline">||\cdot||_2</math>, there must exist a Lipschitz constant <math display="inline">M < 1</math> such that: <br />
<br />
\begin{align} <br />
\forall \mu, \nu \in \Omega: ||g(\mu) - g(\nu)||_2 \leqslant M||\mu - \nu||_2 <br />
\end{align}<br />
<br />
As stated earlier, <math display="inline">g</math> is a contraction mapping if the spectral norm [12] of the Jacobian <math display="inline">\mathcal{H}</math> [[#Footnotes | (3)]] is below one, or equivalently, if the the largest singular value of <math display="inline">\mathcal{H}</math> is less than 1.<br />
<br />
To find the singular values of <math display="inline">\mathcal{H}</math>, the authors used an explicit formula derived by Blinn [2] for <math display="inline">2\times2</math> matrices, which states that the largest singular value of the matrix is <math display="inline">\frac{1}{2}(\sqrt{(a_{11} + a_{22}) ^ 2 + (a_{21} - a{12})^2} + \sqrt{(a_{11} - a_{22}) ^ 2 + (a_{21} + a{12})^2})</math>.<br />
<br />
For <math display="inline">\mathcal{H}</math>, an expression for the largest singular value of <math display="inline">\mathcal{H}</math>, made up of the first-order partial derivatives of the mapping <math display="inline">g</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math>, can be derived given the analytical solutions for <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> (and denoted <math display="inline">S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>).<br />
<br />
From the mean value theorem, we know that for a <math display="inline">t \in [0, 1]</math>, <br />
<br />
[[File:seq.png|600px|center]]<br />
<br />
Therefore, the distance of the singular value at <math display="inline">S(\mu, \omega, \nu, \tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> and at <math display="inline">S(\mu + \Delta\mu, \omega + \Delta\omega, \nu + \Delta\nu, \tau \Delta\tau, \lambda_{\mathrm{01}}, \alpha_{\mathrm{01}})</math> can be bounded above by <br />
<br />
[[File:seq2.png|600px|center]]<br />
<br />
An upper bound was obtained for each partial derivative term above, mainly through algebraic reformulations and by making use of the fact that many of the functions are monotonically increasing or decreasing on the variables they depend on in <math display="inline">\Omega</math> (see pages 17 - 25 in the supplementary materials).<br />
<br />
The <math display="inline">\Delta</math> terms were then set (rather arbitrarily) to be: <math display="inline">\Delta \mu=0.0068097371</math>,<br />
<math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>. Plugging in the upper bounds on the absolute values of the derivative terms for <math display="inline">S</math> and the <math display="inline">\Delta</math> terms yields<br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) - S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) < 0.008747 \]<br />
<br />
Next, the largest singular value is found from a computer-assisted fine grid-search [[#Footnotes | (1)]] over the domain <math display="inline">\Omega</math>, with grid lengths <math display="inline">\Delta \mu=0.0068097371</math>, <math display="inline">\Delta \omega=0.0008292885</math>, <math display="inline">\Delta \nu=0.0009580840</math>, and <math display="inline">\Delta \tau=0.0007323095</math>, which turned out to be <math display="inline">0.9912524171058772</math>. Therefore, <br />
<br />
\[ S(\mu + \Delta \mu,\omega + \Delta \omega,\nu + \Delta \nu,\tau + \Delta \tau,\lambda_{\rm 01},\alpha_{\rm 01}) \leq 0.9912524171058772 + 0.008747 < 1 \]<br />
<br />
Since the largest singular value is smaller than 1, <math display="inline>g</math> is a contraction mapping.<br />
<br />
'''2. g does not map outside its domain.'''<br />
<br />
To prove that <math display="inline">g</math> does not map outside of the domain <math display="inline">\mu \in [-0.1, 0.1]</math> and <math display="inline">\nu \in [0.8, 1.5]</math>, lower and upper bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> were obtained to show that they stay within <math display="inline">\Omega</math>. <br />
<br />
First, it was shown that the derivatives of <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\xi}</math> with respect to <math display="inline">\mu</math> and <math display="inline">\nu</math> are either positive or have the sign of <math display="inline">\omega</math> in <math display="inline">\Omega</math>, so the minimum and maximum points are found at the borders. In <math display="inline">\Omega</math>, it then follows that<br />
<br />
\begin{align}<br />
-0.03106 <\widetilde{\mu}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\mu} \leq \widetilde{\mu}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 0.06773<br />
\end{align}<br />
<br />
and <br />
<br />
\begin{align}<br />
0.80467 <\widetilde{\xi}(-0.1,0.1, 0.8, 0.95, \lambda_{\rm 01}, \alpha_{\rm 01}) \leq & \widetilde{\xi} \leq \widetilde{\xi}(0.1,0.1,1.5, 1.1, \lambda_{\rm 01}, \alpha_{\rm 01}) < 1.48617.<br />
\end{align}<br />
<br />
Since <math display="inline">\widetilde{\nu} = \widetilde{\xi} - \widetilde{\mu}^2</math>, <br />
<br />
\begin{align}<br />
0.80009 & \leqslant \widetilde{\nu} \leqslant 1.48617<br />
\end{align}<br />
<br />
The bounds on <math display="inline">\widetilde{\mu}</math> and <math display="inline">\widetilde{\nu}</math> are narrower than those for <math display="inline">\mu</math> and <math display="inline">\nu</math> set out in <math display="inline">\Omega</math>, therefore <math display="inline">g(\Omega) \subseteq \Omega</math>.<br />
<br />
==== Theorem 2: Decreasing Variance from Above ====<br />
<br />
'''Definition:''' For <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^+: -1 \leqslant \mu \leqslant 1, -0.1 \leqslant \omega \leqslant 0.1, 3 \leqslant \nu \leqslant 16</math>, and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math>, we have for the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> under <math display="inline">g</math>: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) < \nu</math>.<br />
<br />
Theorem 2 states that when <math display="inline">\nu \in [3, 16]</math>, the mapping <math display="inline">g</math> draws it to below 3 when applied across layers, thereby establishing an upper bound of <math display="inline">\nu < 3</math> on variance.<br />
<br />
'''Proof:''' The authors proved the inequality by showing that <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) = \widetilde{\xi}(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01}) - \nu < 0</math>, since the second moment should be greater than or equal to variance <math display="inline">\widetilde{\nu}</math>. The behavior of <math display="inline">\frac{\partial }{\partial \mu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \omega } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, <math display="inline">\frac{\partial }{\partial \nu } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math>, and <math display="inline">\frac{\partial }{\partial \tau } \widetilde{\xi}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> are used to find the bounds on <math display="inline">g(\mu, \omega, \xi, \tau, \lambda_{01}, \alpha_{01})</math> (see pages 9 - 13 in the supplementary materials). Again, the partial derivative terms were monotonic, which made it possible to find the upper bound at the board values. It was shown that the maximum value of <math display="inline">g</math> does not exceed <math display="inline">-0.0180173</math>.<br />
<br />
==== Theorem 3: Increasing Variance from Below ====<br />
<br />
'''Definition''': We consider <math display="inline">\lambda = \lambda_{01}</math>, <math display="inline">\alpha = \alpha_{01}</math>, and the domain <math display="inline">\Omega^-: -0.1 \leqslant \mu \leqslant 0.1</math> and <math display="inline">-0.1 \leqslant \omega \leqslant 0.1</math>. For the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.16</math> and <math display="inline">0.8 \leqslant \tau \leqslant 1.25</math> as well as for the domain <math display="inline">0.02 \leqslant \nu \leqslant 0.24</math> and <math display="inline">0.9 \leqslant \tau \leqslant 1.25</math>, the mapping of the variance <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> increases: <math display="inline">\widetilde{\nu}(\mu, \omega, \nu, \tau, \lambda, \alpha) > \nu</math>.<br />
<br />
Theorem 3 states that the variance <math display="inline">\widetilde{\nu}</math> increases when variance is smaller than in <math display="inline">\Omega</math>. The lower bound on variance is <math display="inline">\widetilde{\nu} > 0.16</math> when <math display="inline">0.8 \leqslant \tau</math> and <math display="inline">\widetilde{\nu} > 0.24</math> when <math display="inline">0.9 \leqslant \tau</math> under the proposed mapping.<br />
<br />
'''Proof:''' According to the mean value theorem, for a <math display="inline">t \in [0, 1]</math>,<br />
<br />
[[File:th3.png|700px|center]]<br />
<br />
Similar to the proof for Theorem 2 (except we are interested in the smallest <math display="inline">\widetilde{\nu}</math> instead of the biggest), the lower bound for <math display="inline">\frac{\partial }{\partial \nu} \widetilde{\xi}(\mu,\omega,\nu+t(\nu_{\mathrm{min}}-\nu),\tau,\lambda_{\rm 01},\alpha_{\rm 01})</math> can be derived, and substituted into the relationship <math display="inline">\widetilde{\nu} = \widetilde{\xi}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) - (\widetilde{\mu}(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}))^2</math>. The lower bound depends on <math display="inline">\tau</math> and <math display="inline">\nu</math>, and in the <math display="inline">\Omega^{-1}</math> listed, it is slightly above <math display="inline">\nu</math>.<br />
<br />
== Implementation Details ==<br />
<br />
=== Initialization ===<br />
<br />
Since SNNs have a fixed point at zero mean and unit variance for normalized weights ω = ni=1 wi = 0 and τ = ni=1 wi2 = 1 (see above), we initialize SNNs such that these constraints are fulfilled in expectation. We draw the weights from a Gaussian distribution with E(wi) = 0 and variance Var(wi) = 1/n. Uniform and truncated Gaussian distributions with these moments led to networks with similar behavior. The “MSRA initialization” is similar since it uses zero mean and variance 2/n to initialize the weights [14]. The additional factor 2 counters the effect of rectified linear units.<br />
<br />
<br />
As previously explained, SNNs work best when inputs to the network are standardized, and the weights are initialized with mean of 0 and variance of <math display="inline">\frac{1}{n}</math> to help converge to the fixed point <math display="inline">(\mu, \nu) = (0, 1)</math>.<br />
<br />
=== Dropout Technique ===<br />
<br />
The authors reason that regular dropout, randomly setting activations to 0 with probability <math display="inline">1 - q</math>, is not compatible with SELUs. This is because the low variance region in SELUs is at <math display="inline">\lim_{x \rightarrow -\infty} = -\lambda \alpha</math>, not 0. Contrast this with ReLUs, which work well with dropout since they have <math display="inline">\lim_{x \rightarrow -\infty} = 0</math> as the saturation region. Therefore, a new dropout technique for SELUs was needed, termed ''alpha dropout''.<br />
<br />
With alpha dropout, activations are randomly set to <math display="inline">-\lambda\alpha = \alpha'</math>, which for this paper corresponds to the constant <math display="inline">1.7581</math>, with probability <math display="inline">1 - q</math>.<br />
<br />
The updated mean and variance of the activations are now:<br />
\[ \mathrm{E}(xd + \alpha'(1 - d)) = \mu q + \alpha'(1 - q) \] <br />
<br />
and<br />
<br />
\[ \mathrm{Var}(xd + \alpha'(1 - d)) = q((1-q)(\alpha' - \mu)^2 + \nu) \]<br />
<br />
Activations need to be transformed (e.g. scaled) after dropout to maintain the same mean and variance. In regular dropout, conserving the mean and variance correlates to scaling activations by a factor of 1/q while training. To ensure that mean and variance are unchanged after alpha dropout, the authors used an affine transformation <math display="inline">a(xd + \alpha'(1 - d)) + b</math>, and solved for the values of <math display="inline">a</math> and <math display="inline">b</math> to give <math display="inline">a = (\frac{\nu}{q((1-q)(\alpha' - \mu)^2 + \nu)})^{\frac{1}{2}}</math> and <math display="inline">b = \mu - a(q\mu + (1-q)\alpha'))</math>. As the values for <math display="inline">\mu</math> and <math display="inline">\nu</math> are set to <math display="inline">0</math> and <math display="inline">1</math> throughout the paper, these expressions can be simplified into <math display="inline">a = (q + \alpha'^2 q(1-q))^{-\frac{1}{2}}</math> and <math display="inline">b = -(q + \alpha^2 q (1-q))^{-\frac{1}{2}}((1 - q)\alpha')</math>, where <math display="inline">\alpha' \approx 1.7581</math>.<br />
<br />
Empirically, the authors found that dropout rates (1-q) of <math display="inline">0.05</math> or <math display="inline">0.10</math> worked well with SNNs.<br />
<br />
=== Optimizers ===<br />
<br />
Through experiments, the authors found that stochastic gradient descent, momentum, Adadelta and Adamax work well on SNNs. For Adam, configuration parameters <math display="inline">\beta_2 = 0.99</math> and <math display="inline">\epsilon = 0.01</math> were found to be more effective.<br />
<br />
==Experimental Results==<br />
<br />
Three sets of experiments were conducted to compare the performance of SNNs to six other FNN structures and to other machine learning algorithms, such as support vector machines and random forests. The experiments were carried out on (1) 121 UCI Machine Learning Repository datasets, (2) the Tox21 chemical compounds toxicity effects dataset (with 12,000 compounds and 270,000 features), and (3) the HTRU2 dataset of statistics on radio wave signals from pulsar candidates (with 18,000 observations and eight features). In each set of experiment, hyperparameter search was conducted on a validation set to select parameters such as the number of hidden units, number of hidden layers, learning rate, regularization parameter, and dropout rate (see pages 95 - 107 of the supplementary material for exact hyperparameters considered). Whenever models of different setups gave identical results on the validation data, preference was given to the structure with more layers, lower learning rate and higher dropout rate.<br />
<br />
The six FNN structures considered were: (1) FNNs with ReLU activations, no normalization and “Microsoft weight initialization” (MSRA) [5] to control the variance of input signals [5]; (2) FNNs with batch normalization [6], in which normalization is applied to activations of the same mini-batch; (3) FNNs with layer normalization [1], in which normalization is applied on a per layer basis for each training example; (4) FNNs with weight normalization [8], whereby each layer’s weights are normalized by learning the weight’s magnitude and direction instead of the weight vector itself; (5) highway networks, in which layers are not restricted to being sequentially connected [9]; and (6) an FNN-version of residual networks [4], with residual blocks made up of two or three densely connected layers.<br />
<br />
On the Tox21 dataset, the authors demonstrated the self-normalizing effect by comparing the distribution of neural inputs <math display="inline">z</math> at initialization and after 40 epochs of training to that of the standard normal. As Figure 3 show, the distribution of <math display="inline">z</math> remained similar to a normal distribution.<br />
<br />
[[File:snnf3.png|600px]]<br />
<br />
On all three sets of classification tasks, the authors demonstrated that SNN outperformed the other FNN counterparts on accuracy and AUC measures. <br />
<br />
SNN achieves close to the state-of-the-art results on the Tox21 dataset with an 8-layer network. The challenge requires prediction of toxic effects of 12000 chemicals based on their chemical structures. SNN with 8 layers had the best performance.<br />
<br />
[[File:tox21.png|600px]]<br />
<br />
<br />
On UCI datasets with fewer than 1,000 observations, SNNs did not outperform SVMs or random forests in terms of average rank in accuracy, but on datasets with at least 1,000 observations, SNNs showed the best overall performance (average rank of 5.8, compared to 6.1 for support vector machines and 6.6 for random forests). Through hyperparameter tuning, it was also discovered that the average depth of FNNs is 10.8 layers, more than the other FNN architectures tried.<br />
<br />
<br />
<br />
In pulsar predictions with the HTRU dataset SNNs produced a new state-of-the-art AUC by a small margin (achieving an AUC 0.98, averaged over 10 cross-validation folds, versus the previous record of 0.976).<br />
[[File:htru2.png|600px]]<br />
<br />
==Future Work==<br />
<br />
Although not the focus of this paper, the authors also briefly noted that their initial experiments with applying SELUs on relatively simple CNN structures showed promising results, which is not surprising given that ELUs, which do not have the self-normalizing property, has already been shown to work well with CNNs, demonstrating faster convergence than ReLU networks and even pushing the state-of-the-art error rates on CIFAR-100 at the time of publishing in 2015 [3].<br />
<br />
Since the paper was published, SELUs have been adopted by several researchers, not just with FNNs [https://github.com/bioinf-jku/SNNs see link], but also with CNNs, GANs, autoencoders, reinforcement learning and RNNs. In a few cases, researchers for those papers concluded that networks trained with SELUs converged faster than those trained with ReLUs, and that SELUs have the same convergence quality as batch normalization. There is potential for SELUs to be incorporated into more architectures in the future.<br />
<br />
==Critique==<br />
<br />
Overall, the authors presented a convincing case for using SELUs (along with proper initialization and alpha dropout) on FNNs. FNNs trained with SELU have more layers than those with other normalization techniques, so the work here provides a promising direction for making traditional FNNs more powerful. There are not as many well-established benchmark datasets to evaluate FNNs, but the experiments carried out, particularly on the larger Tox21 dataset, showed that SNNs can be very effective at classification tasks.<br />
<br />
The proofs provide a satisfactory explanation for why SELUs have a self-normalising property within the specified domain, but during their introduction the authors give 4 criteria that an activation function must satisfy in order to be self-normalising. Those criteria make intuitive sense, but there is a lack of firm justification which creates some confusion. For example, they state that SNNs cannot be derived from tanh units, even though <math> tanh(\lambda x) </math> satisfies all 4 criteria if <math> \lambda </math> is larger than 1. Assuming the authors did not overlook such a simple modification, there must be some additional criteria for an activation function to have a self-normalising property.<br />
<br />
The only question I have with the proofs is the lack of explanation for how the domains, <math display="inline">\Omega</math>, <math display="inline">\Omega^-</math> and <math display="inline">\Omega^+</math> are determined, which is an important consideration because they are used for deriving the upper and lower bounds on expressions needed for proving the three theorems. The ranges appear somewhat set through trial-and-error and heuristics to ensure the numbers work out (e.g. make the spectral norm [12] of <math display="inline">\mathcal{J}</math> as large as can be below 1 so as to ensure <math display="inline">g</math> is a contraction mapping), so it is not clear whether they are unique conditions, or that the parameters will remain within those prespecified ranges throughout training; and if the parameters can stray away from the ranges provided, then the issue of what will happen to the self-normalizing property was not addressed. Perhaps that is why the authors gave preference to models with a deeper structure and smaller learning rate during experiments to help the parameters stay within their domains. Further, in addition to the hyperparameters considered, it would be helpful to know the final values that went into the best-performing models, for a better understanding of what range of values work better for SNNs empirically.<br />
<br />
==Conclusion==<br />
<br />
The SNN structure proposed in this paper is built on the traditional FNN structure with a few modifications, including the use of SELUs as the activation function (with <math display="inline">\lambda \approx 1.0507</math> and <math display="inline">\alpha \approx 1.6733</math>), alpha dropout, network weights initialized with mean of zero and variance <math display="inline">\frac{1}{n}</math>, and inputs normalized to mean of zero and variance of one. It is simple to implement while being backed up by detailed theory. <br />
<br />
When properly initialized, SELUs will draw neural inputs towards a fixed point of zero mean and unit variance as the activations are propagated through the layers. The self-normalizing property is maintained even when weights deviate from their initial values during training (under mild conditions). When the variance of inputs goes beyond the prespecified range imposed, they are still bounded above and below so SNNs do not suffer from exploding and vanishing gradients. This self-normalizing property allows SNNs to be more robust to perturbations in stochastic gradient descent, so deeper structures with better prediction performance can be built. <br />
<br />
In the experiments conducted, the authors demonstrated that SNNs outperformed FNNs trained with other normalization techniques, such as batch, layer and weight normalization, and specialized architectures, such as highway or residual networks, on several classification tasks, including on the UCI Machine Learning Repository datasets. SELUs help in reducing the computation time for normalizing the network relative to RELU+BN and hence is promising. The adoption of SELUs by other researchers also lends credence to the potential for SELUs to be implemented in more neural network architectures.<br />
<br />
==References==<br />
<br />
# Ba, Kiros and Hinton. "Layer Normalization". arXiv:1607.06450. (2016).<br />
# Blinn. "Consider the Lowly 2X2 Matrix." IEEE Computer Graphics and Applications. (1996).<br />
# Clevert, Unterthiner, Hochreiter. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." arXiv: 1511.07289. (2015).<br />
# He, Zhang, Ren and Sun. "Deep Residual Learning for Image Recognition." arXiv:1512.03385. (2015).<br />
# He, Zhang, Ren and Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852. (2015). <br />
# Ioffe and Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariance Shift." arXiv:1502.03167. (2015).<br />
# Klambauer, Unterthiner, Mayr and Hochreiter. "Self-Normalizing Neural Networks." arXiv: 1706.02515. (2017).<br />
# Salimans and Kingma. "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." arXiv:1602.07868. (2016).<br />
# Srivastava, Greff and Schmidhuber. "Highway Networks." arXiv:1505.00387 (2015).<br />
# Unterthiner, Mayr, Klambauer and Hochreiter. "Toxicity Prediction Using Deep Learning." arXiv:1503.01445. (2015). <br />
# https://en.wikipedia.org/wiki/Central_limit_theorem <br />
# http://mathworld.wolfram.com/SpectralNorm.html <br />
# https://www.math.umd.edu/~petersd/466/fixedpoint.pdf<br />
<br />
==Online Resources==<br />
https://github.com/bioinf-jku/SNNs (GitHub repository maintained by some of the paper's authors)<br />
<br />
==Footnotes==<br />
<br />
1. Error propagation analysis: The authors performed an error analysis to quantify the potential numerical imprecisions propagated through the numerous operations performed. The potential imprecision <math display="inline">\epsilon</math> was quantified by applying the mean value theorem<br />
<br />
\[ |f(x + \Delta x - f(x)| \leqslant ||\triangledown f(x + t\Delta x|| ||\Delta x|| \textrm{ for } t \in [0, 1]\textrm{.} \] <br />
<br />
The error propagation rules, or <math display="inline">|f(x + \Delta x - f(x)|</math>, was first obtained for simple operations such as addition, subtraction, multiplication, division, square root, exponential function, error function and complementary error function. Them, the error bounds on the compound terms making up <math display="inline">\Delta (S(\mu, \omega, \nu, \tau, \lambda, \alpha)</math> were found by decomposing them into the simpler expressions. If each of the variables have a precision of <math display="inline">\epsilon</math>, then it turns out <math display="inline">S</math> has a precision better than <math display="inline">292\epsilon</math>. For a machine with a precision of <math display="inline">2^{-56}</math>, the rounding error is <math display="inline">\epsilon \approx 10^{-16}</math>, and <math display="inline">292\epsilon < 10^{-13}</math>. In addition, all computations are correct up to 3 ulps (“unit in last place”) for the hardware architectures and GNU C library used, with 1 ulp being the highest precision that can be achieved.<br />
<br />
2. Independence Assumption: The classic definition of central limit theorem requires <math display="inline">x_i</math>’s to be independent and identically distributed, which is not guaranteed to hold true in a neural network layer. However, according to the Lyapunov CLT, the <math display="inline">x_i</math>’s do not need to be identically distributed as long as the <math display="inline">(2 + \delta)</math>th moment exists for the variables and meet the Lyapunov condition for the rate of growth of the sum of the moments [11]. In addition, CLT has also shown to be valid under weak dependence under mixing conditions [11]. Therefore, the authors argue that the central limit theorem can be applied with network inputs.<br />
<br />
3. <math display="inline">\mathcal{H}</math> versus <math display="inline">\mathcal{J}</math> Jacobians: In solving for the largest singular value of the Jacobian <math display="inline">\mathcal{H}</math> for the mapping <math display="inline">g: (\mu, \nu)</math>, the authors first worked with the terms in the Jacobian <math display="inline">\mathcal{J}</math> for the mapping <math display="inline">h: (\mu, \nu) \rightarrow (\widetilde{\mu}, \widetilde{\xi})</math> instead, because the influence of <math display="inline">\widetilde{\mu}</math> on <math display="inline">\widetilde{\nu}</math> is small when <math display="inline">\widetilde{\mu}</math> is small in <math display="inline">\Omega</math> and <math display="inline">\mathcal{H}</math> can be easily expressed as terms in <math display="inline">\mathcal{J}</math>. <math display="inline">\mathcal{J}</math> was referenced in the paper, but I used <math display="inline">\mathcal{H}</math> in the summary here to avoid confusion.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Audio_Synthesis_of_Musical_Notes_with_WaveNet_autoencoders&diff=36438Neural Audio Synthesis of Musical Notes with WaveNet autoencoders2018-04-21T03:05:39Z<p>F7xia: /* Conclusion & Future Directions */</p>
<hr />
<div>= Introduction =<br />
The authors of this paper have pointed out that the method in which most notes are created are hand-designed instruments modifying pitch, velocity and filter parameters to produce the required tone, timbre and dynamics of a sound. The authors suggest that this may be a problem and thus suggest a data-driven approach to audio synthesis. They demonstrate how to generate new types of expressive and realistic instrument sounds using a neural network model instead of using specific arrangements of oscillators or algorithms for sample playback. The model is capable of learning semantically meaningful hidden representations which can be used as control signals for manipulating tone, timbre, and dynamics during playback. To train such a data expensive model the authors highlight the need for a large dataset much like ImageNet for music. The motivation for this work stems from recent advances in autoregressive models like WaveNet <sup>[[#References|[5]]]</sup> and SampleRNN<sup>[[#References|[6]]]</sup>. These models are effective at modeling short and medium scale (~500ms) signals, but rely on external conditioning for large-term dependencies; the proposed model removes the need for external conditioning.<br />
<br />
= Contributions =<br />
This paper has two main contributions, one theoretical and one empirical: <br />
<br />
=== Theoretical contribution ===<br />
Proposed Wavenet-style autoencoder that learn to encode temural data over a long term audio structures without requiring external conditioning.<br />
<br />
=== Empirical contribution === <br />
Provided NSynth data set. The authors constructed this data set from scratch, which is a a large data set of musical notes inspired by the emerging of large image data sets. This data set servers as a great training/test resource for future works.<br />
<br />
= Models =<br />
<br />
[[File:paper26-figure1-models.png|center]]<br />
<br />
== WaveNet Autoencoder ==<br />
<br />
While the proposed autoencoder structure is very similar to that of WaveNet the authors argue that the algorithm is novel in two ways:<br />
* It is able to attain consistent long-term structure without any external conditioning <br />
* Creating meaningful embedding which can be interpolated between<br />
<br />
In the original WaveNet architecture the authors use a stack of dilated convolutions to predict the next sample of audio given a prior sample. This approach was prone to "babbling" since it did not take into account long-term structure of the audio. In this model the joint probability of generating audio <math>x</math> is:<br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1\}<br />
\end{align}<br />
<br />
They authors try to capture long-term structure by passing the raw audio through the encoder to produce an embedding <math>Z = f(x) </math>, and then shifting the input and feeding it into the decoder which reproduces the input. The resulting probability distribution: <br />
<br />
\begin{align}<br />
p(x) = \prod_{i=1}^N\{x_i | x_1, … , x_N-1, f(x) \}<br />
\end{align}<br />
<br />
<br />
A detailed block diagram of the modified WaveNet structure can be seen in figure 1b. This diagram demonstrates the encoder as a 30 layer network in each each node is a ReLU nonlinearity followed by a non-causal dilated convolution. Dilated convolution (aka convolutions with holes) is a type of convolution in which the filter skips input values with a certain step (step size of 1 is equivalent to the standard convolution), effectively allowing the network to operate at a coarser scale compared to traditional convolutional layers and have very large receptive fields. The resulting convolution is 128 channels all feed into another ReLU nonlinearity which is feed into another 1x1 convolution before getting down sampled with average pooling to produce a 16 dimension <math>Z </math> distribution. Each <math>Z </math> encoding is for a specific temporal resolution which the authors of the paper tuned to 32ms. This means that there are 125, 16 dimension <math>Z </math> encodings for each 4 second note present in the NSynth database (1984 embeddings). <br />
Before the <math>Z </math> embedding enters the decoder it is first upsampled to the original audio rate using nearest neighbor interpolation. The embedding then passes through the decoder to recreate the original audio note. The input audio data is first quantized using 8-bit mu-law encoding into 256 possible values, and the output prediction is the softmax over the possible values.<br />
<br />
== Baseline: Spectral Autoencoder ==<br />
Being unable to find an alternative fully deep model which the authors could use to compare to there proposed WaveNet autoencoder to, the authors just made a strong baseline. The baseline algorithm that the authors developed is a spectral autoencoder. The block diagram of its architecture can be seen in figure 1a. The baseline network is 10 layer deep. Each layer has a 4x4 kernels with 2x2 strides followed by a leaky-ReLU (0.1) and batch normalization. The final hidden vector(Z) was set to 1984 to exactly match the hidden vector of the WaveNet autoencoder. <br />
<br />
Given the simple architecture, the authors first attempted to train the baseline on raw waveforms as input, with a mean-squared error cost. This did not work well and showed the problem of the independent Gaussian assumption. Spectral representations from FFT worked better, but had low perceptual quality despite having low MSE cost after training. Training on the log magnitude of the power spectra, normalized between 0 and 1, was found to be best correlated with perceptual distortion. The authors also explored several representations of phase, finding that estimating magnitude and using established iterative techniques to reconstruct phase to be most effective. (The technique to reconstruct the phase from the magnitude comes from (Griffin and Lim 1984). It can be summarized as follows. In each iteration, generate a Fourier signal z by taking the Short Time Fourier transform of the current estimate of the complete time-domain signal, and replacing its magnitude component with the known true magnitude. Then find the time-domain signal whose Short Time Fourier transform is closest to z in the least-squares sense. This is the estimate of the complete signal for the next iteration. ) A final heuristic that was used by the authors to increase the accuracy of the baseline was weighting the mean square error (MSE) loss starting at 10 for 0 HZ and decreasing linearly to 1 at 4000 Hz and above. This is valid as the fundamental frequency of most instrument are found at lower frequencies. <br />
<br />
== Training ==<br />
Both the modified WaveNet and the baseline autoencoder used stochastic gradient descent with an Adam optimizer. The authors trained the baseline autoencoder model asynchronously for 1800000 epocs with a batch size of 8 with a learning rate of 1e-4. Where as the WaveNet modules were trained synchronously for 250000 epocs with a batch size of 32 with a decaying learning rate ranging from 2e-4 to 6e-6. Here synchronous training refers to the process of training both the encoder and decoder at the same time.<br />
<br />
= The NSynth Dataset =<br />
To evaluate the WaveNet autoencoder model, the authors' wanted an audio dataset that let them explore the learned embeddings. Musical notes are an ideal setting for this study. Prior to this paper, the existing music datasets included the RWC music database (Goto et al., 2003)<sup>[[#References|[8]]]</sup> and the dataset from Romani Picas et al.<sup>[[#References|[9]]]</sup> However, the authors wanted to develop a larger dataset.<br />
<br />
The NSynth dataset has 306 043 unique musical notes (each have a unique pitch, timbre, envelope) all 4 seconds in length sampled at 16,000 Hz. The data set consists of 1006 different instruments playing on average of 65.4 different pitches across on average 4.75 different velocities. Average pitches and velocities are used as not all instruments, can reach all 88 MIDI frequencies, or the 5 velocities desired by the authors. The dataset has the following split: training set with 289,205 notes, validation set with 12,678 notes, and test set with 4,096 notes.<br />
<br />
Along with each note the authors also included the following annotations:<br />
* Source - The way each sound was produced. There were 3 classes ‘acoustic’, ‘electronic’ and ‘synthetic’.<br />
* Family - The family class of instruments that produced each note. There are 11 classes which include: {‘bass’, ‘brass’, ‘vocal’ ext.}<br />
* Qualities - Sonic qualities about each note<br />
<br />
The full dataset is publicly available here: https://magenta.tensorflow.org/datasets/nsynth as TFRecord files with training and holdout splits.<br />
<br />
[[File:nsynth_table.png | 400px|thumb|center|Full details of the NSynth dataset.]]<br />
<br />
= Evaluation =<br />
<br />
To fully analyze all aspects of WaveNet the authors proposed three evaluations:<br />
* Reconstruction - Both Quantitative and Qualitative analysis were considered<br />
* Interpolation in Timbre and Dynamics<br />
* Entanglement of Pitch and Timbre <br />
<br />
Sound is historically very difficult to quantify from a picture representation as it requires training and expertise to analyze. Even with expertise it can be difficult to complete a full analysis as two very different sounds can look quite similar in their respective pictorial representations. This is why the authors recommend all readers to listen to the created notes which can be found here: https://magenta.tensorflow.org/nsynth.<br />
<br />
However, even when taking this under consideration the authors do pictorially demonstrate differences in the two proposed algorithms along with the original note, as it is hard to publish a paper with sound included. To demonstrate the pictorial difference the authors demonstrate each note using constant-q transform (CQT) which is able to capture the dynamics of timbre along with representing the frequencies of the sound.<br />
<br />
== Reconstruction ==<br />
<br />
[[File:paper27-figure2-reconstruction.png|center]]<br />
<br />
The authors attempted to show magnitude and phase on the same plot above. Instantaneous frequency is the derivative of the phase and the intensity of solid lines is proportional to the log magnitude of the power spectrum. If fharm and an FFT bin are not the same, then there will be a constant phase shift: <br />
<math><br />
\triangle \phi = (f_{bin} − f_{harm}) \dfrac{hopsize}{samplerate}<br />
</math>.<br />
<br />
=== Qualitative Comparison ===<br />
In Figure 2, CQT spectrograms are displayed from 3 different instruments, including the original note spectrograms and the model reconstruction spectrograms. For the model reconstruction spectrograms, a baseline is adopted to compare with WaveNet. Each note contains some noise, a fundamental frequency with a series of harmonics, and a decay. In the Glockenspiel the WaveNet autoencoder is able to reproduce the magnitude, phase of the fundamental frequency (A and C in Figure 2), and the attack (B in Figure 2) of the instrument; Whereas the Baseline autoencoder introduces non existing harmonics (D in Figure 2). The flugelhorn on the other hand, presents the starkest difference between the WaveNet and baseline autoencoders. The WaveNet while not perfect is able to reproduce the verbarto (I and J in Figure 2) across multiple frequencies, which results in a natural sounding note. The baseline not only fails to do this but also adds extra noise (K in Figure 2). The authors do add that the WaveNet produces some strikes (L in Figure 2) however they argue that they are inaudible.<br />
<br />
[[File:paper27-table1.png|center]]<br />
<br />
Mu-law encoding was used in the original WaveNet [https://arxiv.org/pdf/1609.03499.pdf paper] to make the problem "more tractable" compared to raw 16-bit integer values. In that paper, they note that "especially for speech, this non-linear quantization produces a significantly better reconstruction" compared to a linear scheme. This might be expected considering that the mu-law companding transformation was designed to [https://www.cisco.com/c/en/us/support/docs/voice/h323/8123-waveform-coding.html#t4 encode speech]. In this application though, using this encoding creates perceptible distortion that sounds similar to clipping.<br />
<br />
=== Quantitative Comparison ===<br />
For a quantitative comparison the authors trained a separate multi-task classifier to classify a note using given pitch or quality of a note. The results of both the Baseline and the WaveNet where then inputted and attempted to be classified. As seen in table 1 WaveNet significantly outperformed the Baseline in both metrics posting a ~70% increase when only considering pitch.<br />
<br />
== Interpolation in Timbre and Dynamics ==<br />
<br />
[[File:paper27-figure3-interpolation.png|center]]<br />
<br />
For this evaluation the authors reconstructed from linear interpolations in Z space among different instruments and compared these to superimposed position of the original two instruments. Not surprisingly the model fuse aspects of both instruments during the recreation. The authors claim however, that WaveNet produces much more realistic sounding results. <br />
To support their claim the authors the authors point to WaveNet ability to create dynamic mixing of overtone in time, even jumping to higher harmonics (A in Figure 3), capturing the timbre and dynamics of both the bass and flute. This can be once again seen in (B in Figure 3) where Wavenet adds additional harmonics as well as a sub-harmonics to the original flute note.<br />
<br />
== Entanglement of Pitch and Timbre ==<br />
<br />
[[File:paper27-table2.png|center]]<br />
<br />
[[File:paper27-figure4-entanglement.png|center]]<br />
<br />
To study the entanglement between pitch and Z space the authors constructed a classifier which was expected to drop in accuracy if the representation of pitch and timbre is disentangled as it relies heavily on the pitch information. This is clearly demonstrated by the first two rows of table 2 where WaveNet relies more strongly on pitch then the baseline algorithm. The authors provide a more qualitative demonstrating in figure 4. They demonstrate a situation in which a classifier may be confused; a note with pitch of +12 is almost exactly the same as the original apart from an emergence of sub-harmonics.<br />
<br />
Further insight can be gained on the relationship between pitch and timbre by studying the trend amongst the network embeddings among the pitches for specific instruments. This is depicted in figure 5 for several instruments across their entire 88 note range at 127 velocity. It can be noted from the figure that the instruments have unique separation of two or more registers over which the embeddings of notes with different pitches are similar. This is expected since instrumental dynamics and timbre varies dramatically over the range of the instrument.<br />
<br />
= Conclusion & Future Directions =<br />
<br />
This paper presents a Wavelet autoencoder model which is built on top of the WaveNet model and evaluate the model on NSynth dataset. The paper also introduces a new large scale dataset of musical notes: NSynth.<br />
<br />
One significant area which the authors claim great improvement is needed is the large memory constraints required by there algorithm. Due to the large memory requirement the current WaveNet must rely on down sampling thus being unable to fully capture the global context. This is an area where model compression techniques could be beneficial. That is, quantization and pruning could be effective: with 4-bit quantization during the entire process (weights, activations, gradients, error as in the work of Wu et al., 2016<sup>[[#References|[7]]]</sup>), memory requirement could be reduced by at least 8 times. The authors also claim that research using different input representations (instead of mu-law) to minimize distortion is ongoing.<br />
<br />
= Acknowlegments = <br />
A huge thanks to Hans Bernhard with the dataset, Colin Raffel for crucial conversations, and Sageev Oore for thoughtful analysis.<br />
<br />
= Critique = <br />
* Authors have never conducted a human study determining sound similarity between the original, baseline, and WaveNet.<br />
* Architecture is not very novel.<br />
* In order to have a comparison, they set out to create a straight-forward baseline for the neural audio synthesis experiments.<br />
<br />
= Open Source Code =<br />
<br />
Google has released all code related to this paper at the following open source repository: https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth<br />
<br />
= References =<br />
<br />
# Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D. & Simonyan, K.. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1068-1077<br />
# Griffin, Daniel, and Jae Lim. "Signal estimation from modified short-time Fourier transform." IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (1984): 236-243.<br />
# NSynth: Neural Audio Synthesis. (2017, April 06). Retrieved March 19, 2018, from https://magenta.tensorflow.org/nsynth <br />
# The NSynth Dataset. (2017, April 05). Retrieved March 19, 2018, from https://magenta.tensorflow.org/datasets/nsynth<br />
# Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).<br />
# Mehri, Soroush, et al. "SampleRNN: An unconditional end-to-end neural audio generation model." arXiv preprint arXiv:1612.07837 (2016).<br />
# Wu, S., Li, G., Chen, F., & Shi, L. (2018). Training and Inference with Integers in Deep Neural Networks. arXiv preprint arXiv:1802.04680.<br />
# Goto, Masataka, Hashiguchi, Hiroki, Nishimura, Takuichi, and Oka, Ryuichi. Rwc music database: Music genre database and musical instrument sound database. 2003.<br />
# Romani Picas, Oriol, Parra Rodriguez, Hector, Dabiri, Dara, Tokuda, Hiroshi, Hariya, Wataru, Oishi, Koji, and Serra, Xavier. A real-time system for measuring sound goodness in instrumental sounds. In Audio Engineering Society Convention 138. Audio Engineering Society, 2015</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=PointNet%2B%2B:_Deep_Hierarchical_Feature_Learning_on_Point_Sets_in_a_Metric_Space&diff=36432PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space2018-04-21T02:59:59Z<p>F7xia: /* PointNet++ */</p>
<hr />
<div>= Introduction =<br />
This paper builds off of ideas from PointNet (Qi et al., 2017). The name PointNet is derived from the network's input - a point cloud. A point cloud is a set of three dimensional points that each have coordinates <math> (x,y,z) </math>. These coordinates usually represent the surface of an object. For example, a point cloud describing the shape of a torus is shown below.<br />
<br />
[[File:Point cloud torus.gif|thumb|center|Point cloud torus]]<br />
<br />
<br />
Processing point clouds is important in applications such as autonomous driving where point clouds are collected from an onboard LiDAR sensor. These point clouds can then be used for object detection. However, point clouds are challenging to process because:<br />
<br />
# They are unordered. If <math> N </math> is the number of points in a point cloud, then there are <math> N! </math> permutations that the point cloud can be represented.<br />
# The spatial arrangement of the points contains useful information, thus it needs to be encoded.<br />
# The function processing the point cloud needs to be invariant to transformations such as rotation and translations of all points. <br />
<br />
Previously, typical point cloud processing methods handled the challenges of point clouds by transforming the data with a 3D voxel grid or by representing the point cloud with multiple 2D images. When PointNet was introduced, it was novel because it directly took points as its input. PointNet++ improves on PointNet by using a hierarchical method to better capture local structures of the point cloud. <br />
<br />
[[File:point_cloud.png | 400px|thumb|center|Examples of point clouds and their associated task. Classification (left), part segmentation (centre), scene segmentation (right) ]]<br />
<br />
= Related Work =<br />
<br />
This paper provides a new network to understand point clouds. CNNs are the most prominent deep networks which find features in images/videos. However, the convolution process is not applicable to point clouds as point clouds contain unordered set of points with distance metric. Some techniques apply deep learning to unordered sets however they don't account for the distance metric in their model and are sensitive to translation and global normalization. Techniques like volumetric grids and geometric graphs work on the 3D metric space but the problem of non-uniform sampling density hasn't been considered.<br />
<br />
= Review of PointNet =<br />
<br />
The PointNet architecture is shown below. The input of the network is <math> n </math> points, which each have <math> (x,y,z) </math> coordinates. Each point is processed individually through a multi-layer perceptron (MLP). This network creates an encoding for each point; in the diagram, each point is represented by a 1024 dimension vector. Then, using a max pool layer a vector is created that represents the "global signature" of a point cloud. If classification is the task, this global signature is processed by another MLP to compute the classification scores. If segmentation is the task, this global signature is appended to to each point from the "nx64" layer, and these points are processed by a MLP to compute a semantic category score for each point.<br />
<br />
The core idea of the network is to learn a symmetric function on transformed points. Through the T-Nets and the MLP network, a transformation is learned with the hopes of making points invariant to point cloud transformations. Learning a symmetric function solves the challenge imposed by having unordered points; a symmetric function will produce the same value no matter the order of the input. This symmetric function is represented by the max pool layer.<br />
<br />
[[File:pointnet_arch.png | 700px|thumb|center|PointNet architecture. The blue highlighted region is when it is used for classification, and the beige highlighted region is when it is used for segmentation.]]<br />
<br />
= PointNet++ =<br />
<br />
The motivation for PointNet++ is that PointNet does not capture local, fine-grained details. Since PointNet performs a max pool layer over all of its points, information such as the local interaction between points is lost. Max-pooling reduces the dimensionality of the network in a very cheap way (with no parameters), but is only good some tasks since it only provides translatioinal variance. And by translational variance we mean that the same object with slightly change in orientation or posiution might not fire up the neuron that is supposed to recognize that object.<br />
<br />
== Problem Statement ==<br />
<br />
There is a metric space <math> X = (M,d) </math> where <math>d</math> is the metric from a Euclidean space <math>\pmb{\mathbb{R}}^n</math> and <math> M \subseteq \pmb{\mathbb{R}}^n </math> is the set of points. The goal is to learn functions <math>f</math> that take <math>X</math> as the input and produce information of semantic interest about it. In practice, <math>f</math> can often be a classification function that outputs a class label or a segmentation function that outputs a per point label for each member of <math>M</math>.<br />
<br />
== Method ==<br />
<br />
=== High Level Overview ===<br />
[[File:point_net++.png | 700px|thumb|right|PointNet++ architecture]]<br />
<br />
The PointNet++ architecture is shown on the right. The core idea is that a hierarchical architecture is used and at each level of the hierarchy a set of points is processed and abstracted to a new set with less points, i.e.,<br />
<br />
\begin{aligned}<br />
\text{Input at each level: } N \times (d + c) \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N</math> is the number of points, <math>d</math> is the coordinate points <math>(x,y,z)</math> and <math>c</math> is the feature representation of each point, and<br />
<br />
\begin{aligned}<br />
\text{Output at each level: } N' \times (d + c') \text{ matrix}<br />
\end{aligned}<br />
<br />
where <math>N'</math> is the new number (smaller) of points and <math>c'</math> is the new feature vector.<br />
<br />
<br />
Each level has three layers: Sampling, Grouping, and PointNet. The Sampling layer selects points that will act as centroids of local regions within the point cloud. The Grouping layer then finds points near these centroids. Lastly, the PointNet layer performs PointNet on each group to encode local information.<br />
<br />
=== Sampling Layer ===<br />
<br />
The input of this layer is a set of points <math>{\{x_1,x_2,...,x_n}\}</math>. The goal of this layer is to select a subset of these points <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_m\}} </math> that will define the centroid of local regions.<br />
<br />
To select these points farthest point sampling is used. This is where <math>\hat{x}_j</math> is the most distant point with regards to <math>{\{\hat{x}_1, \hat{x}_2,...,\hat{x}_{j-1}\}}</math>. This ensures coverage of the entire point cloud opposed to random sampling.<br />
<br />
=== Grouping Layer ===<br />
<br />
The objective of the grouping layer is to form local regions around each centroid by grouping points near the selected centroids. The input is a point set of size <math>N \times (d + c)</math> and the coordinates of the centroids <math>N' \times d</math>. The output is the groups of points within each region <math>N' \times k \times (d+c)</math> where <math>k</math> is the number of points in each region.<br />
<br />
Note that <math>k</math> can vary per group. Later, the PointNet layer creates a feature vector that is the same size for all regions at a hierarchical level.<br />
<br />
To determine which points belong to a group a ball query is used; all points within a radius of the centroid are grouped. This is advantageous over nearest neighbour because it guarantees a fixed region space, which is important when learning local structure.<br />
<br />
=== PointNet Layer ===<br />
<br />
After grouping, PointNet is applied to the points. However, first the coordinates of points in a local region are converted to a local coordinate frame by <math> x_i = x_i - \bar{x}</math> where <math>\bar{x}</math> is the coordinates of the centroid.<br />
<br />
=== Robust Feature Learning under Non-Uniform Sampling Density ===<br />
<br />
The previous description of grouping uses a single scale. This is not optimal because the density varies per section of the point cloud. At each level, it would be better if the PointNet layer was applied to adaptively sized groups depending on the point cloud density.<br />
<br />
The two grouping methods the authors propose are shown in the diagram below. Multi-scale grouping (MSG) applies PointNet at various scales per group. The features from the various scales are concatenated to form a multi-scale feature. To train the network to learn an optimal strategy for combining the multi-scale features, the authors proposed random input dropout, which involves randomly dropping input points with a random probability for each training point set. Each input point has a dropout probability <math>\theta</math>. The authors used a <math>\theta</math> value of 0.95. As shown in the experiments section below, dropout provides robustness to input point density variations. During testing stage all points are used. MSG, however, is computationally expensive because for each region it always applies PointNet at large scale neighborhoods to all points. <br />
<br />
On the other hand, multi-resolution grouping (MRG) is less computationally expensive but still adaptively collects features. As shown in the diagram, features of a region from a certain level is a concatenation of two vectors. The left vector is obtained by applying PointNet to three points, and these three points obtained information from three groups. This vector is then concatenated by a vector that is created by using PointNet on all the points in the level below. The second vector can be weighed more heavily if the first vector contains a sparse amount of points, since the first vector is based on subregions that would be even more sparse and suffer from sampling deficiency. On the other hand, when the density of a local region is high, the first vector can be weighted more heavily as it allows for inspecting at higher resolutions in the lower levels to obtain finer details. <br />
<br />
[[File:grouping.png | 300px|thumb|center|Example of the two ways to perform grouping]]<br />
<br />
== Network Architectures ==<br />
<br />
=== all classification experiments with K categories ===<br />
SA(512, 0.2, [64, 64, 128]) → SA(128, 0.4, [128, 128, 256]) → SA([256, 512, 1024]) → F C(512, 0.5) → F C(256, 0.5) → F C(K)<br />
=== The multi-scale grouping (MSG) network ===<br />
SA(512, [0.1, 0.2, 0.4], [[32, 32, 64], [64, 64, 128], [64, 96, 128]]) → SA(128, [0.2, 0.4, 0.8], [[64, 64, 128], [128, 128, 256], [128, 128, 256]]) → SA([256, 512, 1024]) → F C(512, 0.5) → F C(256, 0.5) → F C(K)<br />
=== The cross level multi-resolution grouping (MRG) network ===<br />
Branch 1: SA(512, 0.2, [64, 64, 128]) → SA(64, 0.4, [128, 128, 256])<br />
Branch 2: SA(512, 0.4, [64, 128, 256]) using r = 0.4 regions of original points <br />
Branch 3: SA(64, 128, 256, 512) using all original points.<br />
Branch 4: SA(256, 512, 1024).<br />
<br />
== Point Cloud Segmentation ==<br />
<br />
If the task is segmentation, the architecture is slightly modified since we want a semantic score for each point. To achieve this, distance-based interpolation and skip-connections are used.<br />
<br />
=== Distance-based Interpolation ===<br />
<br />
Here, point features from <math>N_l \times (d + C)</math> points are propagated to <math>N_{l-1} \times (d + C)</math> points where <math>N_{l-1}</math> is greater than <math>N_l</math>.<br />
<br />
To propagate features an inverse distance weighted average based on <math>k</math> nearest neighbors is used. The <math>p=2</math> and <math>k=3</math>.<br />
<br />
[[File:prop_feature.png | 500px|thumb|center|Feature interpolation during segmentation]]<br />
<br />
=== Skip-connections ===<br />
<br />
In addition, skip connections are used (see the PointNet++ architecture diagram). The features from the the skip layers are concatenated with the interpolated features. Next, a "unit-wise" PointNet is applied, which the authors describe as similar to a one-by-one convolution.<br />
<br />
== Experiments ==<br />
To validate the effectiveness of PointNet++, experiments in three areas were performed - classification in Euclidean metric space, semantic scene labeling, and classification in non-Euclidean space.<br />
<br />
=== Point Set Classification in Euclidean Metric Space ===<br />
<br />
The digit dataset, MNIST, was converted to a 2D point cloud. Pixel intensities were normalized in the range of <math>[0, 1]</math>, and only pixels with intensities larger than 0.5 were considered. The coordinate system was set at the centre of the image. PointNet++ achieved a classification error of 0.51%. The original PointNet had 0.78% classification error. The table below compares these results to the state-of-the-art.<br />
<br />
[[File:mnist_results.png | 300px|thumb|center|MNIST classification results.]]<br />
<br />
In addition, the ModelNet40 dataset was used. This dataset consists of CAD models. Three dimensional point clouds were sampled from mesh surfaces of the ModelNet40 shapes. The classification results from this dataset are shown below. The last row in the table below, "Ours (with normal)" used face normals (normal is the same for the entire face, regardless of the point picked on that face) as additional point features as well as additional points <math>(N = 5000)</math> to boost performance. All these points are normalized to have zero mean and be within one unit ball. The network contains three hierarchical levels with three fully connected layers.<br />
<br />
[[File:modelnet40.png | 300px|thumb|center|ModelNet40 classification results.]]<br />
<br />
An experiment was performed to show how the accuracy was affected by the number of points used. With PointNet++ using multi-scale grouping and dropout, the performance decreased by less than 1% when 1024 test points were reduced to 256. On the other hand, PointNet's performance was impacted by the decrease in points. This is not surprising because the dropout feature of PointNet++ ensures that the model is trained specifically to be robust to loss of points. <br />
<br />
[[File:paper28_fig4_chair.png | 300px|thumb|center|An example showing the reduction of points visually. At 256 points, the points making up the object is very spare, however the accuracy is only reduced by 1%]][[File:num_points_acc.png | 300px|thumb|center|Relationship between accuracy and the number of points used for classification.]]<br />
<br />
=== Semantic Scene Labelling ===<br />
<br />
The ScanNet dataset was used for experiments in semantic scene labelling. This dataset consists of laser scans of indoor scenes where the goal is to predict a semantic label for each point. Example results are shown below.<br />
<br />
[[File:scannet.png | 300px|thumb|center|Example ScanNet semantic segmentation results.]]<br />
<br />
To compare to other methods, the authors convert their point labels to a voxel format, and accuracy is determined on a per voxel basis. The accuracy compared to other methods is shown below.<br />
<br />
[[File:scannet_acc.png | 500px|thumb|center|ScanNet semantic segmentation classification comparison to other methods.]]<br />
<br />
To test how the trained model performed on scans with non-uniform sampling density, virtual scans of ScanNet scenes were synthesized and the network was evaluated on this data. It can be seen from the above figures that SSG performance greatly falls due to the sampling density shift. MRG network, on the other hand, is more robust to the sampling density shift since it is able to automatically switch to features depicting coarser granularity when the sampling is sparse. This proves the effectiveness of the proposed density adaptive layer design.<br />
<br />
=== Classification in Non-Euclidean Metric Space ===<br />
<br />
[[File:shrec.png | 300px|thumb|right|Example of shapes from the SHREC15 dataset.]]<br />
<br />
Lastly, experiments were performed on the SHREC15 dataset. This dataset contains shapes that have different poses. This experiment shows that PointNet++ is able to generalize to non-Euclidean spaces. Results from this dataset are provided below.<br />
<br />
[[File:shrec15_results.png | 500px|thumb|center|Results from the SHREC15 dataset.]]<br />
<br />
=== Feature Visualization ===<br />
The figure below visualizes what is learned by just the first layer kernels of the network. The model is trained on a dataset the mostly consisted of furniture which explains the lines, corners, and planes visible in the visualization. Visualization is performed by creating a voxel grid in space and only aggregating point sets that activate specific neurons the most.<br />
<br />
[[File:26_8.PNG | 800px|thumb|center|Point clouds learned from first layer kernels (red is near, blue is far)]]<br />
<br />
=== Time and Space Complexity ===<br />
To evaluate the time and space complexity between PointNet++ and PointNet the authors recorded the model size and inference time for several point set deep learning approaches. Inference time is measured with a GTX 1080 and batch size 8. The PointNet inference times are significantly better however the model sizes for PointNet++ are comparable. The table below outlines the specifics for each method.<br />
<br />
Worth noting is that ours MSG, while it has good performance in non-uniformly sampled data, it’s twice as expensive as then SSG version due the multi-scale region feature extraction. Compared with MSG,<br />
MRG is more efficient since it uses regions across layers.<br />
<br />
[[File:pointnet_complexity.PNG | 700px|thumb|center|Comparison of model size and inference time between PointNet and PointNet++]]<br />
<br />
== Critique ==<br />
<br />
It seems clear that PointNet is lacking capturing local context between points. PointNet++ seems to be an important extension, but the improvements in the experimental results seem small. Some computational efficiency experiments would have been nice. For example, the processing speed of the network, and the computational efficiency of MRG over MRG.<br />
<br />
It may be useful to note that not all raw point clouds coming from sensors are completely unordered. For example, the points from a LiDAR scanner are ordered by the specific angles of the laser scans, which can be seen as rings in the point cloud (shown in the figure below). The discontinuities along each ring could be used to provide trackable feature information for SLAM algorithms, and would be harder to recover if the point cloud is represented in an unordered manner.<br />
<br />
[[File:point_net_pp_lidar_scan.png | 500px|thumb|center|Example LiDAR point cloud]]<br />
<br />
== Code ==<br />
<br />
Code for PointNet++ can be found at: https://github.com/charlesq34/pointnet2<br />
<br />
= Conclusion =<br />
In this work, we propose PointNet++, a powerful neural network architecture for processing point<br />
sets sampled in a metric space. PointNet++ recursively functions on a nested partitioning of the<br />
input point set, and is effective in learning hierarchical features with respect to the distance metric.<br />
To handle the non uniform point sampling issue, we propose two novel set abstraction layers that<br />
intelligently aggregate multi-scale information according to local point densities. These contributions<br />
enable us to achieve state-of-the-art performance on challenging benchmarks of 3D point clouds.<br />
In the future, it’s worthwhile thinking how to accelerate inference speed of our proposed network<br />
especially for MSG and MRG layers by sharing more computation in each local regions. It’s also<br />
interesting to find applications in higher dimensional metric spaces where CNN based method would<br />
be computationally unfeasible while our method can scale well.<br />
<br />
=Sources=<br />
# Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, 2017<br />
# Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, 2017<br />
# Charles R. Qi, “charlesq34/pointnet2.” GitHub, 25 Feb. 2018, github.com/charlesq34/pointnet2.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=36426stat946w18/Tensorized LSTMs2018-04-21T02:49:14Z<p>F7xia: /* Related work */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. Increasing the width (increasing the number of units in a hidden layer) causes the number of parameters to increase quadratically which in turn increases the time required for model training and evaluation. While increasing the depth by stacking multiple LSTMs increases runtime by a proportional amount. As an alternative He et. al (2017) in ''Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning'' have proposed a model based on LSTM called the '''Tensorized LSTM''' in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x_{1:t} = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNNs learn how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from the initial time-step to the final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state h_t is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^{(R+M)\times M} </math> guarantees each hidden state provided by the previous step is of dimension M. <math> a_t ∈R^M </math> is the hidden activation, and φ(·) is the element-wise hyperbolic tangent. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function. Note that the <math>\phi</math> is a non-linear, element-wise function which generates hidden output, while <math>\varphi</math> generates the final network output.<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
One shortfall of RNN is the problem of vanishing/exploding gradients. This shortfall is significant, especially when modeling long-range dependencies. One alternative is to instead use LSTM (Long Short-Term Memory), which alleviates these problems by employing several gates to selectively modulate the information flow across each neuron. Since LSTMs have been successfully used in sequence models, it is natural to consider them for accommodating more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. This convolution is defined by:<br />
<br />
\begin{align}<br />
C_{t-1,p,m}^{conv} = \sum\limits_{k=1}^K C_{t-1,p-\frac{K-1}{2}+k,m} · W_{t,k,1,1}^c(p) \hspace{2cm} (30)<br />
\end{align}<br />
<br />
where <math>C_{t-1}</math> is padded with the boundary values to retain the stored information.<br />
<br />
Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
Theorem 17-18 of Leifert et al. [3] proves the prevention of vanishing/exploding gradients for the lambda gate, which is very similar to the proposed memory cell convolution kernel. The only major differences between the two are the use of softmax for normalization and the sharing the of the kernal for all channels. Since these changes to not affect the assertions made in Theorem 17-18, it can be established that the prevention of vanishing/exploding gradients is a feature of the memory cell convolution kernel as well.<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the original tensor, normalized tensor, gain parameter and bias parameter. The <math>m^z</math>-th channel of <math>\mathbf{Z}</math> is normalized element-wisely: <br />
<br />
\begin{align}<br />
\hat{z_{m^z}} = (z_{m^z} - z^\mu)/z^{\sigma} \hspace{2cm} (21)<br />
\end{align}<br />
<br />
where <math>z^{\mu}</math>, <math>z^{\sigma} \in \mathbb{R}^P</math> are the mean and standard deviation along the channel dimension of <math>\mathbf{Z}</math>, and <math>\hat{z_{m^z}} \in \mathbb{R}^P</math> is the <math>m^z</math>-th channel <math>\mathbf{\hat{Z}}</math>. Channel normalization introduces very few additional parameters compared to the number of other parameters in the model.<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
All configurations are evaluated with depths L = 1, 2, 3, 4. Bits-per-character(BPC) is used to measure the model performance and the results are shown in the figure below.<br />
[[File:wiki.png |280px|center||Figure 5: WifiPerf]]<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. Two tasks are used for evaluation on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
In both tasks, all configurations are evaluated with M = 100 and L= 1, 3, 5. The model performance is measured by the classification accuracy and results are shown in the figure below.<br />
<br />
[[File:MNISTperf.png |480px|center]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
It can be seen in the above figure that tLSTM behaves differently with different tasks:<br />
<br />
- Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa<br />
<br />
- addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum <br />
<br />
- memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep<br />
<br />
- sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction <br />
<br />
- sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.<br />
<br />
From the figure above it can can also be observe some common phenomena in all tasks: <br />
# it is clear that wider (larger) tensors can encode more information by observing that at each timestep, the values at different tensor locations are markedly different<br />
# from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.<br />
<br />
= Related work =<br />
=== Convolutional LSTMs ===<br />
<br />
Convolutional LSTMs are proposed to parallelize the computation of LSTMs when the input at each timestep is structured<br />
[[File:clstm.png|150px|center||Figure 1: Example of Convolutional LSTMs]]<br />
<br />
=== Deep LSTMs ===<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches. The paper shows a method to widen and deepen the LSTM network at the same time and the following 3 points list their main contributions:<br />
* The RNNs are now tensorized into higher dimensional tensors which are more flexible.<br />
* RNNs' deep computation is merged into the temporal computation, referred to as the tensorizedRNN.<br />
* tRNN is extended to a LSTM architecture and a new architecture is studied: tensorizedLSTM<br />
<br />
= Critique(to be edited) =<br />
<br />
* Using tensor as hidden layer indeed increasing the capability of the network, but authors never mentioned the trade-off in terms of extra computation cost and training time.<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018><br />
#Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimensional recurrent neural networks. JMLR, 17(1):3313–3349, 2016.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=36425stat946w18/Tensorized LSTMs2018-04-21T02:48:44Z<p>F7xia: /* Convolutional LSTMs */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. Increasing the width (increasing the number of units in a hidden layer) causes the number of parameters to increase quadratically which in turn increases the time required for model training and evaluation. While increasing the depth by stacking multiple LSTMs increases runtime by a proportional amount. As an alternative He et. al (2017) in ''Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning'' have proposed a model based on LSTM called the '''Tensorized LSTM''' in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x_{1:t} = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNNs learn how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from the initial time-step to the final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state h_t is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^{(R+M)\times M} </math> guarantees each hidden state provided by the previous step is of dimension M. <math> a_t ∈R^M </math> is the hidden activation, and φ(·) is the element-wise hyperbolic tangent. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function. Note that the <math>\phi</math> is a non-linear, element-wise function which generates hidden output, while <math>\varphi</math> generates the final network output.<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
One shortfall of RNN is the problem of vanishing/exploding gradients. This shortfall is significant, especially when modeling long-range dependencies. One alternative is to instead use LSTM (Long Short-Term Memory), which alleviates these problems by employing several gates to selectively modulate the information flow across each neuron. Since LSTMs have been successfully used in sequence models, it is natural to consider them for accommodating more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. This convolution is defined by:<br />
<br />
\begin{align}<br />
C_{t-1,p,m}^{conv} = \sum\limits_{k=1}^K C_{t-1,p-\frac{K-1}{2}+k,m} · W_{t,k,1,1}^c(p) \hspace{2cm} (30)<br />
\end{align}<br />
<br />
where <math>C_{t-1}</math> is padded with the boundary values to retain the stored information.<br />
<br />
Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
Theorem 17-18 of Leifert et al. [3] proves the prevention of vanishing/exploding gradients for the lambda gate, which is very similar to the proposed memory cell convolution kernel. The only major differences between the two are the use of softmax for normalization and the sharing the of the kernal for all channels. Since these changes to not affect the assertions made in Theorem 17-18, it can be established that the prevention of vanishing/exploding gradients is a feature of the memory cell convolution kernel as well.<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the original tensor, normalized tensor, gain parameter and bias parameter. The <math>m^z</math>-th channel of <math>\mathbf{Z}</math> is normalized element-wisely: <br />
<br />
\begin{align}<br />
\hat{z_{m^z}} = (z_{m^z} - z^\mu)/z^{\sigma} \hspace{2cm} (21)<br />
\end{align}<br />
<br />
where <math>z^{\mu}</math>, <math>z^{\sigma} \in \mathbb{R}^P</math> are the mean and standard deviation along the channel dimension of <math>\mathbf{Z}</math>, and <math>\hat{z_{m^z}} \in \mathbb{R}^P</math> is the <math>m^z</math>-th channel <math>\mathbf{\hat{Z}}</math>. Channel normalization introduces very few additional parameters compared to the number of other parameters in the model.<br />
<br />
= Results and Evaluation =<br />
<br />
Summary of list of models tLSTM family (may be useful later):<br />
<br />
(a) sLSTM (baseline): the implementation of sLSTM with parameters shared across all layers.<br />
<br />
(b) 2D tLSTM: the standard 2D tLSTM.<br />
<br />
(c) 2D tLSTM–M: removing memory (M) cell convolutions from (b).<br />
<br />
(d) 2D tLSTM–F: removing (–) feedback (F) connections from (b).<br />
<br />
(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.<br />
<br />
(f) 3D tLSTM+LN: applying (+) Layer Normalization.<br />
<br />
(g) 3D tLSTM+CN: applying (+) Channel Normalization.<br />
<br />
=== Efficiency Analysis ===<br />
<br />
'''Fundaments:''' For each configuration, fix the parameter number and increase the tensor size to see if the performance of tLSTM can be boosted without increasing the parameter number. Can also investigate how the runtime is affected by the depth, where the runtime is measured by the average GPU milliseconds spent by a forward and backward pass over one timestep of a single example. <br />
<br />
'''Dataset:''' The Hutter Prize Wikipedia dataset consists of 100 million characters taken from 205 different characters including alphabets, XML markups and special symbols. We model the dataset at the character-level, and try to predict the next character of the input sequence.<br />
<br />
All configurations are evaluated with depths L = 1, 2, 3, 4. Bits-per-character(BPC) is used to measure the model performance and the results are shown in the figure below.<br />
[[File:wiki.png |280px|center||Figure 5: WifiPerf]]<br />
[[File:Wiki_Performance.png |480px|center||Figure 5: WifiPerf]]<br />
<br />
=== Accuracy Analysis ===<br />
<br />
The MNIST dataset [35] consists of 50000/10000/10000 handwritten digit images of size 28×28 for training/validation/test. Two tasks are used for evaluation on this dataset:<br />
<br />
(a) '''Sequential MNIST:''' The goal is to classify the digit after sequentially reading the pixels in a scan-line order. It is therefore a 784 time-step sequence learning task where a single output is produced at the last time-step; the task requires very long range dependencies in the sequence.<br />
<br />
(b) '''Sequential Permuted MNIST:''' We permute the original image pixels in a fixed random order, resulting in a permuted MNIST (pMNIST) problem that has even longer range dependencies across pixels and is harder.<br />
<br />
In both tasks, all configurations are evaluated with M = 100 and L= 1, 3, 5. The model performance is measured by the classification accuracy and results are shown in the figure below.<br />
<br />
[[File:MNISTperf.png |480px|center]]<br />
<br />
<br />
<br />
[[File:Acc_res.png |480px|center||Figure 5: MNIST]]<br />
<br />
[[File:33_mnist.PNG|center|thumb|800px| This figure displays a visualization of the means of the diagonal channels of the tLSTM memory cells per task. The columns indicate the time steps and the rows indicate the diagonal locations. The values are normalized between 0 and 1.]]<br />
<br />
It can be seen in the above figure that tLSTM behaves differently with different tasks:<br />
<br />
- Wikipedia: the input can be carried to the output location with less modification if it is sufficient to determine the next character, and vice versa<br />
<br />
- addition: the first integer is gradually encoded into memories and then interacts (performs addition) with the second integer, producing the sum <br />
<br />
- memorization: the network behaves like a shift register that continues to move the input symbol to the output location at the correct timestep<br />
<br />
- sequential MNIST: the network is more sensitive to the pixel value change (representing the contour, or topology of the digit) and can gradually accumulate evidence for the final prediction <br />
<br />
- sequential pMNIST: the network is sensitive to high value pixels (representing the foreground digit), and we conjecture that this is because the permutation destroys the topology of the digit, making each high value pixel potentially important.<br />
<br />
From the figure above it can can also be observe some common phenomena in all tasks: <br />
# it is clear that wider (larger) tensors can encode more information by observing that at each timestep, the values at different tensor locations are markedly different<br />
# from the input to the output, the values become increasingly distinct and are shifted by time, revealing that deep computations are indeed performed together with temporal computations, with long-range dependencies carried by memory cells.<br />
<br />
= Related work =<br />
=== Convolutional LSTMs ===<br />
<br />
Convolutional LSTMs are proposed to parallelize the computation of LSTMs when the input at each timestep is structured<br />
[[File:clstm.png|150px|center||Figure 1: Example of Convolutional LSTMs]]<br />
<br />
=== Deep LSTMs ===<br />
<br />
=== Other Parallelization Methods ===<br />
<br />
= Conclusions =<br />
<br />
The paper introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the temporal computation to perform the deep computation for sequential tasks. Then validated the model<br />
on a variety of tasks, showing its potential over other popular approaches. The paper shows a method to widen and deepen the LSTM network at the same time and the following 3 points list their main contributions:<br />
* The RNNs are now tensorized into higher dimensional tensors which are more flexible.<br />
* RNNs' deep computation is merged into the temporal computation, referred to as the tensorizedRNN.<br />
* tRNN is extended to a LSTM architecture and a new architecture is studied: tensorizedLSTM<br />
<br />
= Critique(to be edited) =<br />
<br />
* Using tensor as hidden layer indeed increasing the capability of the network, but authors never mentioned the trade-off in terms of extra computation cost and training time.<br />
<br />
= References =<br />
#Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. <Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning> (2017)<br />
#Ali Ghodsi, <Deep Learning: STAT 946 - Winter 2018><br />
#Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, and Roger Labahn. Cells in multidimensional recurrent neural networks. JMLR, 17(1):3313–3349, 2016.</div>F7xiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946w18/Tensorized_LSTMs&diff=36424stat946w18/Tensorized LSTMs2018-04-21T02:48:26Z<p>F7xia: /* Convolutional LSTMs */</p>
<hr />
<div>= Presented by =<br />
<br />
Chen, Weishi(Edward)<br />
<br />
= Introduction =<br />
<br />
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. Increasing the width (increasing the number of units in a hidden layer) causes the number of parameters to increase quadratically which in turn increases the time required for model training and evaluation. While increasing the depth by stacking multiple LSTMs increases runtime by a proportional amount. As an alternative He et. al (2017) in ''Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning'' have proposed a model based on LSTM called the '''Tensorized LSTM''' in which the hidden states are represented by '''tensors''' and updated via a '''cross-layer convolution'''. <br />
<br />
* By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor<br />
* By delaying the output, the network can be deepened implicitly with little additional run-time since deep computations for each time step are merged into temporal computations of the sequence. <br />
<br />
<br />
Also, the paper presents experiments that were conducted on five challenging sequence learning tasks to show the potential of the proposed model.<br />
<br />
= A Quick Introduction to RNN and LSTM =<br />
<br />
We consider the time-series prediction task of producing a desired output <math>y_t</math> at each time-step t∈ {1, ..., T} given an observed input sequence <math>x_{1:t} = {x_1,x_2, ···, x_t}</math>, where <math>x_t∈R^R</math> and <math>y_t∈R^S</math> are vectors. RNNs learn how to use a hidden state vector <math>h_t ∈ R^M</math> to encapsulate the relevant features of the entire input history x1:t (indicates all inputs from the initial time-step to the final step before predication - illustration given below) up to time-step t.<br />
<br />
\begin{align}<br />
h_{t-1}^{cat} = [x_t, h_{t-1}] \hspace{2cm} (1)<br />
\end{align}<br />
<br />
Where <math>h_{t-1}^{cat} ∈R^{R+M}</math> is the concatenation of the current input <math>x_t</math> and the previous hidden state <math>h_{t−1}</math>, which expands the dimensionality of intermediate information.<br />
<br />
The update of the hidden state h_t is defined as:<br />
<br />
\begin{align}<br />
a_{t} =h_{t-1}^{cat} W^h + b^h \hspace{2cm} (2)<br />
\end{align}<br />
<br />
and<br />
<br />
\begin{align}<br />
h_t = \Phi(a_t) \hspace{2cm} (3)<br />
\end{align}<br />
<br />
<math>W^h∈R^{(R+M)\times M} </math> guarantees each hidden state provided by the previous step is of dimension M. <math> a_t ∈R^M </math> is the hidden activation, and φ(·) is the element-wise hyperbolic tangent. Finally, the output <math> y_t </math> at time-step t is generated by:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t}^{cat} W^y + b^y) \hspace{2cm} (4)<br />
\end{align}<br />
<br />
where <math>W^y∈R^{M×S}</math> and <math>b^y∈R^S</math>, and <math>\varphi(·)</math> can be any differentiable function. Note that the <math>\phi</math> is a non-linear, element-wise function which generates hidden output, while <math>\varphi</math> generates the final network output.<br />
<br />
[[File:StdRNN.png|650px|center||Figure 1: Recurrent Neural Network]]<br />
<br />
One shortfall of RNN is the problem of vanishing/exploding gradients. This shortfall is significant, especially when modeling long-range dependencies. One alternative is to instead use LSTM (Long Short-Term Memory), which alleviates these problems by employing several gates to selectively modulate the information flow across each neuron. Since LSTMs have been successfully used in sequence models, it is natural to consider them for accommodating more complex analytical needs.<br />
<br />
[[File:LSTM_Gated.png|650px|center||Figure 2: LSTM]]<br />
<br />
= Structural Measurement of Sequential Model =<br />
<br />
We can consider the capacity of a network consists of two components: the '''width''' (the amount of information handled in parallel) and the depth (the number of computation steps). <br />
<br />
A way to '''widen''' the LSTM is to increase the number of units in a hidden layer; however, the parameter number scales quadratically with the number of units. To deepen the LSTM, the popular Stacked LSTM (sLSTM) stacks multiple LSTM layers. The drawback of sLSTM, however, is that runtime is proportional to the number of layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it propagates vertically through the layers. This paper introduced a way to both widen and deepen the LSTM whilst keeping the parameter number and runtime largely unchanged. In summary, we make the following contributions:<br />
<br />
'''(a)''' Tensorize RNN hidden state vectors into higher-dimensional tensors, to enable more flexible parameter sharing and can be widened more efficiently without additional parameters.<br />
<br />
'''(b)''' Based on (a), merge RNN deep computations into its temporal computations so that the network can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).<br />
<br />
'''(c)''' We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a novel memory cell convolution to help to prevent the vanishing/exploding gradients.<br />
<br />
= Method =<br />
<br />
== Part 1: Tensorize RNN hidden State vectors ==<br />
<br />
'''Definition:''' Tensorization is defined as the transformation or mapping of lower-order data to higher-order data. For example, the low-order data can be a vector, and the tensorized result is a matrix, a third-order tensor or a higher-order tensor. The ‘low-order’ data can also be a matrix or a third-order tensor, for example. In the latter case, tensorization can take place along one or multiple modes.<br />
<br />
[[File:VecTsor.png|320px|center||Figure 3: Vector Third-order tensorization of a vector]]<br />
<br />
'''Optimization Methodology Part 1:''' It can be seen that in an RNN, the parameter number scales quadratically with the size of the hidden state. A popular way to limit the parameter number when widening the network is to organize parameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that contain significantly fewer elements, which is is known as tensor factorization. <br />
<br />
'''Optimization Methodology Part 2:''' Another common way to reduce the parameter number is to share a small set of parameters across different locations in the hidden state, similar to Convolutional Neural Networks (CNNs).<br />
<br />
'''Effects:''' This '''widens''' the network since the hidden state vectors are in fact broadcast to interact with the tensorized parameters. <br />
<br />
<br />
<br />
We adopt parameter sharing to cutdown the parameter number for RNNs, since compared with factorization, it has the following advantages: <br />
<br />
(i) '''Scalability,''' the number of shared parameters can be set independent of the hidden state size<br />
<br />
(ii) '''Separability,''' the information flow can be carefully managed by controlling the receptive field, allowing one to shift RNN deep computations to the temporal domain<br />
<br />
<br />
<br />
We also explicitly tensorize the RNN hidden state vectors, since compared with vectors, tensors have a better: <br />
<br />
(i) '''Flexibility,''' one can specify which dimensions to share parameters and then can just increase the size of those dimensions without introducing additional parameters<br />
<br />
(ii) '''Efficiency,''' with higher-dimensional tensors, the network can be widened faster w.r.t. its depth when fixing the parameter number (explained later). <br />
<br />
<br />
'''Illustration:''' For ease of exposition, we first consider 2D tensors (matrices): we tensorize the hidden state <math>h_t∈R^{M}</math> to become <math>Ht∈R^{P×M}</math>, '''where P is the tensor size,''' and '''M the channel size'''. We locally-connect the first dimension of <math>H_t</math> (which is P - the tensor size) in order to share parameters, and fully-connect the second dimension of <math>H_t</math> (which is M - the channel size) to allow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g., the RGB channel for input images) to globally fuse different feature planes. Also, if one compares <math>H_t</math> to the hidden state of a Stacked RNN (sRNN) (see Figure Blow). <br />
<br />
[[File:Screen_Shot_2018-03-26_at_11.28.37_AM.png|160px|center||Figure 4: Stacked RNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 4: Stacked RNN]]<br />
<br />
Then P is akin to the number of stacked hidden layers (vertical length in the graph), and M the size of each hidden layer (each white node in the graph). We start to describe our model based on 2D tensors, and finally show how to strengthen the model with higher-dimensional tensors.<br />
<br />
== Part 2: Merging Deep Computations ==<br />
<br />
Since an RNN is already deep in its temporal direction, we can deepen an input-to-output computation by associating the input <math>x_t</math> with a (delayed) future output. In doing this, we need to ensure that the output <math>y_t</math> is separable, i.e., not influenced by any future input <math>x_{t^{'}}</math> <math>(t^{'}>t)</math>. Thus, we concatenate the projection of <math>x_t</math> to the top of the previous hidden state <math>H_{t−1}</math>, then gradually shift the input information down when the temporal computation proceeds, and finally generate <math>y_t</math> from the bottom of <math>H_{t+L−1}</math>, where L−1 is the number of delayed time-steps for computations of depth L. <br />
<br />
An example with L= 3 is shown in Figure.<br />
<br />
[[File:tRNN.png|160px|center||Figure 5: skewed sRNN]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN]]<br />
<br />
<br />
This is in fact a skewed sRNN (or tRNN without feedback). However, the method does not need to change the network structure and also allows different kinds of interactions as long as the output is separable; for example, one can increase the local connections and '''use feedback''' (shown in figure below), which can be beneficial for sRNNs (or tRNN). <br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
'''In order to share parameters, we update <math>H_t</math> using a convolution with a learnable kernel.''' In this manner we increase the complexity of the input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition parameters using convolutions).<br />
<br />
To examine the resulting model mathematically, let <math>H^{cat}_{t−1}∈R^{(P+1)×M}</math> be the concatenated hidden state, and <math>p∈Z_+</math> the location at a tensor. The channel vector <math>h^{cat}_{t−1, p }∈R^M</math> at location p of <math>H^{cat}_{t−1}</math> (the p-th channel of H) is defined as:<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = x_t W^x + b^x \hspace{1cm} if p = 1 \hspace{1cm} (5)<br />
\end{align}<br />
<br />
\begin{align}<br />
h^{cat}_{t-1, p} = h_{t-1, p-1} \hspace{1cm} if p > 1 \hspace{1cm} (6)<br />
\end{align}<br />
<br />
where <math>W^x ∈ R^{R×M}</math> and <math>b^x ∈ R^M</math> (recall the dimension of input x is R). Then, the update of tensor <math>H_t</math> is implemented via a convolution:<br />
<br />
\begin{align}<br />
A_t = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (7)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t = \Phi{A_t} \hspace{2cm} (8)<br />
\end{align}<br />
<br />
where <math>W^h∈R^{K×M^i×M^o}</math> is the kernel weight of size K, with <math>M^i =M</math> input channels and <math>M^o =M</math> output channels, <math>b^h ∈ R^{M^o}</math> is the kernel bias, <math>A_t ∈ R^{P×M^o}</math> is the hidden activation, and <math>\circledast</math> is the convolution operator. Since the kernel convolves across different hidden layers, we call it the cross-layer convolution. The kernel enables interaction, both bottom-up and top-down across layers. Finally, we generate <math>y_t</math> from the channel vector <math>h_{t+L−1,P}∈R^M</math> which is located at the bottom of <math>H_{t+L−1}</math>:<br />
<br />
\begin{align}<br />
y_t = \varphi(h_{t+L−1}, _PW^y + b^y) \hspace{2cm} (9)<br />
\end{align}<br />
<br />
Where <math>W^y ∈R^{M×S}</math> and <math>b^y ∈R^S</math>. To guarantee that the receptive field of <math>y_t</math> only covers the current and previous inputs x1:t. (Check the Skewed sRNN again below):<br />
<br />
[[File:tRNN_wF.png|160px|center||Figure 5: skewed sRNN with F]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: skewed sRNN with F]]<br />
<br />
=== Quick Summary of Set of Parameters ===<br />
<br />
'''1. <math> W^x</math> and <math>b_x</math>''' connect input to the first hidden node<br />
<br />
'''2. <math> W^h</math> and <math>b_h</math>''' convolute between layers<br />
<br />
'''3. <math> W^y</math> and <math>b_y</math>''' produce output of each stages<br />
<br />
<br />
== Part 3: Extending to LSTMs==<br />
<br />
Similar to standard RNN, to allow the tRNN (skewed sRNN) to capture long-range temporal dependencies, one can straightforwardly extend it<br />
to a tLSTM by replacing the tRNN tensors:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (10)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t)] \hspace{2cm} (11)<br />
\end{align}<br />
<br />
Which are pretty similar to tRNN case, the main differences can be observes for memory cells of tLSTM (Ct):<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1} \odot F_t \hspace{2cm} (12)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (13)<br />
\end{align}<br />
<br />
Note that since the previous memory cell <math>C_{t-1}</math> is only gated along the temporal direction, increasing the tensor size ''P'' might result in the loss of long-range dependencies from the input to the output.<br />
<br />
Summary of the terms: <br />
<br />
1. '''<math>\{W^h, b^h \}</math>:''' Kernel of size K <br />
<br />
2. '''<math>A^g_t, A^i_t, A^f_t, A^o_t \in \mathbb{R}^{P\times M}</math>:''' Activations for the new content <math>G_t</math><br />
<br />
3. '''<math>I_t</math>:''' Input gate<br />
<br />
4. '''<math>F_t</math>:''' Forget gate<br />
<br />
5. '''<math>O_t</math>:''' Output gate<br />
<br />
6. '''<math>C_t \in \mathbb{R}^{P\times M}</math>:''' Memory cell<br />
<br />
Then, see graph below for illustration:<br />
<br />
[[File:tLSTM_wo_MC.png |160px|center||Figure 5: tLSTM wo MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM wo MC]]<br />
<br />
To further evolve tLSTM, we invoke the '''Memory Cell Convolution''' to capture long-range dependencies from multiple directions, we additionally introduce a novel memory cell convolution, by which the memory cells can have a larger receptive field (figure provided below). <br />
<br />
[[File:tLSTM_w_MC.png |160px|center||Figure 5: tLSTM w MC]]<br />
<br />
[[File:ind.png|60px|center||Figure 5: tLSTM w MC]]<br />
<br />
One can also dynamically generate this convolution kernel so that it is both time - and location-dependent, allowing for flexible control over long-range dependencies from different directions. Mathematically, it can be represented in with the following formulas:<br />
<br />
\begin{align}<br />
[A^g_t, A^i_t, A^f_t, A^o_t, A^q_t] = H^{cat}_{t-1} \circledast \{W^h, b^h \} \hspace{2cm} (14)<br />
\end{align}<br />
<br />
\begin{align}<br />
[G_t, I_t, F_t, O_t, Q_t]= [\Phi{(A^g_t)}, σ(A^i_t), σ(A^f_t), σ(A^o_t), ς(A^q_t)] \hspace{2cm} (15)<br />
\end{align}<br />
<br />
\begin{align}<br />
W_t^c(p) = reshape(q_{t,p}, [K, 1, 1]) \hspace{2cm} (16)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_{t-1}^{conv}= C_{t-1} \circledast W_t^c(p) \hspace{2cm} (17)<br />
\end{align}<br />
<br />
\begin{align}<br />
C_t= G_t \odot I_t + C_{t-1}^{conv} \odot F_t \hspace{2cm} (18)<br />
\end{align}<br />
<br />
\begin{align}<br />
H_t= \Phi{(C_t )} \odot O_t \hspace{2cm} (19)<br />
\end{align}<br />
<br />
where the kernel <math>{W^h, b^h}</math> has additional <K> output channels to generate the activation <math>A^q_t ∈ R^{P×<K>}</math> for the dynamic kernel bank <math>Q_t∈R^{P × <K>}</math>, <math>q_{t,p}∈R^{<K>}</math> is the vectorized adaptive kernel at the location p of <math>Q_t</math>, and <math>W^c_t(p) ∈ R^{K×1×1}</math> is the dynamic kernel of size K with a single input/output channel, which is reshaped from <math>q_{t,p}</math>. Each channel of the previous memory cell <math>C_{t-1}</math> is convolved with <math>W_t^c(p)</math> whose values vary with <math>p</math>, to form a memory cell convolution, which produces a convolved memory cell <math>C_{t-1}^{conv} \in \mathbb{R}^{P\times M}</math>. This convolution is defined by:<br />
<br />
\begin{align}<br />
C_{t-1,p,m}^{conv} = \sum\limits_{k=1}^K C_{t-1,p-\frac{K-1}{2}+k,m} · W_{t,k,1,1}^c(p) \hspace{2cm} (30)<br />
\end{align}<br />
<br />
where <math>C_{t-1}</math> is padded with the boundary values to retain the stored information.<br />
<br />
Note the paper also employed a softmax function ς(·) to normalize the channel dimension of <math>Q_t</math>. which can also stabilize the value of memory cells and help to prevent the vanishing/exploding gradients. An illustration is provided below to better illustrate the process:<br />
<br />
[[File:MCC.png |240px|center||Figure 5: MCC]]<br />
<br />
Theorem 17-18 of Leifert et al. [3] proves the prevention of vanishing/exploding gradients for the lambda gate, which is very similar to the proposed memory cell convolution kernel. The only major differences between the two are the use of softmax for normalization and the sharing the of the kernal for all channels. Since these changes to not affect the assertions made in Theorem 17-18, it can be established that the prevention of vanishing/exploding gradients is a feature of the memory cell convolution kernel as well.<br />
<br />
To improve training, the authors introduced a new normalization technique for ''t''LSTM termed channel normalization (adapted from layer normalization), in which the channel vector are normalized at different locations with their own statistics. Note that layer normalization does not work well with ''t''LSTM, because lower level information is near the input and higher level information is near the output. Channel normalization (CN) is defined as: <br />
<br />
\begin{align}<br />
\mathrm{CN}(\mathbf{Z}; \mathbf{\Gamma}, \mathbf{B}) = \mathbf{\hat{Z}} \odot \mathbf{\Gamma} + \mathbf{B} \hspace{2cm} (20)<br />
\end{align}<br />
<br />
where <math>\mathbf{Z}</math>, <math>\mathbf{\hat{Z}}</math>, <math>\mathbf{\Gamma}</math>, <math>\mathbf{B} \in \mathbb{R}^{P \times M^z}</math> are the orig