stat946F18/differentiableplasticity: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
'''Differentiable Plasticity'''
'''Differentiable Plasticity'''
\usepackage{amsmath}
= Presented by =
= Presented by =


Line 28: Line 29:
= Related Work =
= Related Work =


Previous Approaches to solving this problem:
Previous Approaches to solving this problem.


  1) Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization.  
  1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization.  
   
   
  2) Another approach: Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns.  
  2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns.  
   
   
  3) The other approach is to optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand.  
  3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand.  
   
   
  4) Another method involves having all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network.  
  4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network.  
   
   
  5) Another approach performs gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent.  
  5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent.  
   
   
  6) For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.
  6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.


The superiority of the trainable synaptic plasticity for the meta-learning approach in the paper:  
The superiority of the trainable synaptic plasticity for the meta-learning approach in the paper:  
   
   
1) Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attentional mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined
1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attentional mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined
by (trainable) network structure.
by (trainable) network structure.


2Fixed-weight recurrent networks, meanwhile, require neurons to be used for both
2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper.  
storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper.  


3) Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory.  
3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory.  


= Model =
= Model =
Line 58: Line 59:
Model Components:  
Model Components:  


1) A connection between any two neurons $i$ and $j$ has both a fixed component and a plastic component.  
1. A connection between any two neurons <math display = "inline">i</math> and <math display = "inline">j</math> has both a fixed component and a plastic component.  


2) The fixed part is just a traditional connection weight $w_{i,j}$ . The plastic part is stored in a Hebbian trace $Hebb_{i,j} $, which varies during a
2. The fixed part is just a traditional connection weight <math display = "inline">w_{i,j}</math> . The plastic part is stored in a Hebbian trace <math display = "inline">H_{i,j}</math>, which varies during a
lifetime according to ongoing inputs and outputs.  
lifetime according to ongoing inputs and outputs.  


3) The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity
3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity
coefficient $\alpha_{i,j}$, which multiplies the Hebbian trace to form
coefficient <math display = "inline">\alpha_{i,j}</math>, which multiplies the Hebbian trace to form
the full plastic component of the connection.  
the full plastic component of the connection.  


The network equations are as follows:  
The network equations are as follows:  
<math>
 
        $$ x_j(t) = \sigma\{\mathlarger{\sum}_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] \} $$
 
<math display="block">
          x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] }
</math>
</math>






<math>\begin{align}
<math display="block">
     H_{i,j}(t+1) = \eta x-i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t)  
     H_{i,j}(t+1) = \eta x-i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t)  
\end{align}</math>
</math>


The $x_j(t)$ is the output of neuron $j$. Here the equation 1 gives the activation function, where the $w_{i,j}$ is a fixed component and the remaining term ($ \alpha_{i,j} H_{i,j}(t))x_i(t-1) $) is a plastic component. The $\sigma$ is a nonlinear function. The $H_{i,j}$ in the second equation is updated as a function of ongoing inputs and outputs.  
The <math display = "inline">x_j(t)</math> is the output of neuron <math display = "inline">j</math>. Here the first equation gives the activation function, where the <math display = "inline">w_{i,j}</math> is a fixed component and the remaining term (<math display = "inline"> \alpha_{i,j} H_{i,j}(t))x_i(t-1) </math>) is a plastic component. The <math display = "inline">\sigma</math> is a nonlinear function. It is always chosed to be tanh in this paper. The <math display = "inline">H_{i,j}</math> in the second equation is updated as a function of ongoing inputs and outputs.  


From the equation 1 above, a connection can be fully fixed if $\alpha = 0 $ or fully plastic if $w = 0$ or have both a fixed and plastic components.  
From first equation above, a connection can be fully fixed if <math display = "inline">\alpha = 0 </math> or fully plastic if <math display = "inline">w = 0</math> or have both a fixed and plastic components.  






The terms $w_{i,j}$ and $\alpha_{i,j}$ are the structural parameters trained by gradient descent. The $\eta$ which denotes the learning rate is also an optimized parameter of the network.  After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the $\eta$ could make the Hebbian traces to decay to 0 in the absence of input. So the equation is replaced by the following:  
The terms <math display = "inline">w_{i,j}</math> and <math display = "inline">\alpha_{i,j}</math> are the structural parameters trained by gradient descent. The <math display = "inline">\eta</math> which denotes the learning rate is also an optimized parameter of the network.  After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the <math display = "inline">\eta</math> could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows:  


$$ H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t)]$$
 
<math display="block">
H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t)]
</math>

Revision as of 17:49, 18 October 2018

Differentiable Plasticity \usepackage{amsmath}

Presented by

1. Ganapathi Subramanian, Sriram

Motivation

1. Neural Networks which is the basis of the modern artificial intelligence techniques, is static in nature in terms of architecture. Once a Neural Network is trained the network architecture components (ex. network connections) cannot be changed and thus effectively learning stops with the training step. If a different task needs to be considered, then the agent must be trained again from scratch.

2. Plasticity is the characteristic of biological systems like humans, which is capable of changing the network connections over time. This enables lifelong learning in biological systems and thus is capable of adapting to dynamic changes in the environment with great sample efficiency in the data observed. This is called synaptic plasticity which is based on the Hebb's rule i.e. If a neuron repeatedly takes part in making another neuron fire, the connection between them is strengthened.


3. Differential plasticity is a step in this direction. The plastic connections' behavior is trained using gradient descent so that the previously trained networks can adapt to changing conditions.

Example: Using the current state of the art supervised learning examples, we can train Neural Networks to recognize specific letters that it has seen during training. Using lifelong learning the agent can know about any alphabet, including those that it has never been exposed to during training.

Objectives

The paper has the following objectives:

1. To tackle to problem of meta-learning (learning to learn).

2. To design neural networks with plastic connections with a special emphasis on gradient descent capability for backpropagation training.

3. To use Backpropagation to optimize both the base weights and the amount of plasticity in each connection.

4. To demonstrate the performance of such networks on three complex and different domains namely complex pattern memorization, one shot classification and reinforcement learning.


Related Work

Previous Approaches to solving this problem.

1. Train standard recurrent neural networks to incorporate past experience in their future responses within each episode. For the learning abilities, the RNN is attached with an external content-addressable memory bank. An attention mechanism within the controller network does the read-write to the memory bank and thus enables fast memorization. 

2. Augment each weight with a plastic component that automatically grows and decays as a function of inputs and outputs. All connection have the same non-trainable plasticity and only the corresponding weights are trained. Recent approaches have tried fast-weights which augments recurrent networks with fast-changing Hebbian weights and computes activations at each step. The network has a high bias towards the recently seen patterns. 

3. Optimize the learning rule itself instead of the connections. A parametrized learning rule is used where the structure of the network is fixed beforehand. 

4. Have all the weight updates to be computed on the fly by the network itself or by a separate network at each time step. Pros are the flexibility and the cons are the large learning burden placed on the network. 

5. Perform gradient descent via propagation during the episode. The meta-learning involves training the base network for it to be fine-tuned using additional gradient descent. 

6. For classification tasks, a separate embedding is trained to discriminate between different classes. Classification is then a comparison between the embedding of the test and example instances.

The superiority of the trainable synaptic plasticity for the meta-learning approach in the paper:

1. Great potential for flexibility. Example, Memory Networks enforce a specific memory storage model in which memories must be embedded in fixed-size vectors and retrieved through some attentional mechanism. In contrast, trainable synaptic plasticity translates into very different forms of memory, the exact implementation of which can be determined by (trainable) network structure.

2. Fixed-weight recurrent networks, meanwhile, require neurons to be used for both storage and computation which increases the computational burdens on neurons. This is avoided in the approach suggested in the paper.

3. Non-trainable plasticity networks can exploit network connectivity for storage of short-term information, but their uniform, non-trainable plasticity imposes a stereotypical behavior on these memories. In the synaptic plasticity, the amount and rate of plasticity are actively molded by the mechanism itself. Also, it allows for more sustained memory.

Model

The formulation proposed in the paper is in such a way that the plastic and non-plastic components for each connection are kept separate, while multiple Hebbian rules can be easily defined.

Model Components:

1. A connection between any two neurons [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] has both a fixed component and a plastic component.

2. The fixed part is just a traditional connection weight [math]\displaystyle{ w_{i,j} }[/math] . The plastic part is stored in a Hebbian trace [math]\displaystyle{ H_{i,j} }[/math], which varies during a lifetime according to ongoing inputs and outputs.

3. The relative importance of plastic and fixed components in the connection is structurally determined by the plasticity coefficient [math]\displaystyle{ \alpha_{i,j} }[/math], which multiplies the Hebbian trace to form the full plastic component of the connection.

The network equations are as follows:


[math]\displaystyle{ x_j(t) = \sigma{\displaystyle \sum_{i \in \text{inputs}}[(w_{i,j}x_i(t-1) + \alpha_{i,j} H_{i,j}(t))x_i(t-1)] } }[/math]


[math]\displaystyle{ H_{i,j}(t+1) = \eta x-i(t-1) x_j(t) + (1 - \eta) H_{i,j}(t) }[/math]

The [math]\displaystyle{ x_j(t) }[/math] is the output of neuron [math]\displaystyle{ j }[/math]. Here the first equation gives the activation function, where the [math]\displaystyle{ w_{i,j} }[/math] is a fixed component and the remaining term ([math]\displaystyle{ \alpha_{i,j} H_{i,j}(t))x_i(t-1) }[/math]) is a plastic component. The [math]\displaystyle{ \sigma }[/math] is a nonlinear function. It is always chosed to be tanh in this paper. The [math]\displaystyle{ H_{i,j} }[/math] in the second equation is updated as a function of ongoing inputs and outputs.

From first equation above, a connection can be fully fixed if [math]\displaystyle{ \alpha = 0 }[/math] or fully plastic if [math]\displaystyle{ w = 0 }[/math] or have both a fixed and plastic components.


The terms [math]\displaystyle{ w_{i,j} }[/math] and [math]\displaystyle{ \alpha_{i,j} }[/math] are the structural parameters trained by gradient descent. The [math]\displaystyle{ \eta }[/math] which denotes the learning rate is also an optimized parameter of the network. After this training, the agent can learn automatically from ongoing experience. In equation 2 above, the [math]\displaystyle{ \eta }[/math] could make the Hebbian traces to decay to 0 in the absence of input. So another form of the equation is as follows:


[math]\displaystyle{ H_{i,j}(t+1) = H_{i,j}(t) + \eta x_j(t)(x_i(t-1) - x_j(t)H_{i,j}(t)] }[/math]