stat946F18/differentiableplasticity: Difference between revisions
No edit summary |
No edit summary |
||
Line 28: | Line 28: | ||
= Related Work = | = Related Work = | ||
Previous Approaches to | Previous Approaches to solving this problem: | ||
1) Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory | 1) Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. | ||
2) Another approach: Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. | 2) Another approach: Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. | ||
3) The other approach is to optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed | 3) The other approach is to optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. | ||
4) Another method involves having all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros | 4) Another method involves having all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. | ||
5) Another approach performs gradient descent via propagation during the episode. The meta learning involves training the base network for it to be fine tuned using additional gradient descent. | 5) Another approach performs gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. | ||
6) For classification tasks a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances. | 6) For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances. | ||
The superiority of the trainable synaptic plasticity for the meta-learning approach in the paper: | |||
1) Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attentional mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined | |||
by (trainable) network structure. | |||
2) Fixed-weight recurrent networks, meanwhile, require neurons to be used for both | |||
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper. | |||
3) Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory. | |||
= Model = | |||
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined. | |||
Model Components: | |||
1) A connection between any two neurons $i$ and $j$ has both a fixed component and a plastic component. | |||
2) The fixed part is just a traditional connection weight $w_{i,j}$ . The plastic part is stored in a Hebbian trace $Hebb_{i,j} $, which varies during a | |||
lifetime according to ongoing inputs and outputs. | |||
3) The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity | |||
coefficient $\alpha_{i,j}$, which multiplies the Hebbian trace to form | |||
the full plastic component of the connection. | |||
The network equations are as follows: | |||
<math> | |||
$$ x_j(t) = \sigma\{\mathlarger{\sum}_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] \} $$ | |||
</math> | |||
<math>\begin{align} | |||
H_{i,j}(t+1) = \eta x-i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) | |||
\end{align}</math> | |||
The $x_j(t)$ is the output of neuron $j$. Here the equation 1 gives the activation function, where the $w_{i,j}$ is a fixed component and the remaining term ($ \alpha_{i,j} H_{i,j}(t))x_i(t-1) $) is a plastic component. The $\sigma$ is a nonlinear function. The $H_{i,j}$ in the second equation is updated as a function of ongoing inputs and outputs. | |||
From the equation 1 above, a connection can be fully fixed if $\alpha = 0 $ or fully plastic if $w = 0$ or have both a fixed and plastic components. | |||
The terms $w_{i,j}$ and $\alpha_{i,j}$ are the structural parameters trained by gradient descent. The $\eta$ which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the $\eta$ could make the Hebbian traces to decay to 0 in the absence of input. So the equation is replaced by the following: | |||
$$ H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t)]$$ |
Revision as of 12:57, 18 October 2018
Differentiable Plasticity
Presented by
1. Ganapathi Subramanian, Sriram
Motivation
1. Neural Networks which is the basis of the modern artificial intelligence techniques, is static in nature in terms of architecture. Once a Neural Network is trained the network architecture components (ex. network connections) cannot be changed and thus effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch.
2. Plasticity is the characteristic of biological systems like humans, which is capable of changing the network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule i.e. If a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened.
3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions.
Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.
Objectives
The paper has the following objectives:
1. To tackle to problem of meta-learning (learning to learn).
2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training.
3. To use Backpropagation to optimize both the base weights and the amount of plasticity in each connection.
4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.
Related Work
Previous Approaches to solving this problem:
1) Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. 2) Another approach: Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. 3) The other approach is to optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. 4) Another method involves having all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. 5) Another approach performs gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. 6) For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.
The superiority of the trainable synaptic plasticity for the meta-learning approach in the paper:
1) Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attentional mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined by (trainable) network structure.
2) Fixed-weight recurrent networks, meanwhile, require neurons to be used for both storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper.
3) Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory.
Model
The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined.
Model Components:
1) A connection between any two neurons $i$ and $j$ has both a fixed component and a plastic component.
2) The fixed part is just a traditional connection weight $w_{i,j}$ . The plastic part is stored in a Hebbian trace $Hebb_{i,j} $, which varies during a lifetime according to ongoing inputs and outputs.
3) The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity coefficient $\alpha_{i,j}$, which multiplies the Hebbian trace to form the full plastic component of the connection.
The network equations are as follows: [math]\displaystyle{ $$ x_j(t) = \sigma\{\mathlarger{\sum}_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] \} $$ }[/math]
[math]\displaystyle{ \begin{align} H_{i,j}(t+1) = \eta x-i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) \end{align} }[/math]
The $x_j(t)$ is the output of neuron $j$. Here the equation 1 gives the activation function, where the $w_{i,j}$ is a fixed component and the remaining term ($ \alpha_{i,j} H_{i,j}(t))x_i(t-1) $) is a plastic component. The $\sigma$ is a nonlinear function. The $H_{i,j}$ in the second equation is updated as a function of ongoing inputs and outputs.
From the equation 1 above, a connection can be fully fixed if $\alpha = 0 $ or fully plastic if $w = 0$ or have both a fixed and plastic components.
The terms $w_{i,j}$ and $\alpha_{i,j}$ are the structural parameters trained by gradient descent. The $\eta$ which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the $\eta$ could make the Hebbian traces to decay to 0 in the absence of input. So the equation is replaced by the following:
$$ H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t)]$$