Difference between revisions of "STAT946F17/ Learning Important Features Through Propagating Activation Differences"

From statwiki
Jump to: navigation, search
(Numerical results)
Line 64: Line 64:
== Numerical results ==
== Numerical results ==
The authors perform two main examples to test whether the suggestions of DeepLIFT method are based on correct "understanding" of what the model is doing.
The authors performed two main experiments to test whether the suggestions in DeepLIFT method are based on correct "understanding" of what the model is doing.
===MNIST handwriting re-morphing===
===MNIST handwriting re-morphing===

Revision as of 11:05, 28 October 2017

This is a summary of ICML 2017 paper [1].


Deep neuron network is purported for its "black box" nature which is a barrier to adoption in applications where interpretability is essential. Also, the "black box" nature brings difficulties for analyzing and improving the structure of the model. In this paper, DeepLIFT method is presented to decompose the output of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. This is a form of sensitivity analysis and it helps people to understand the model better.

Sensitivity Analysis

Sensitivity Analysis is a concept in risk management and actuarial science. According to [Invectopedia], a sensitivity analysis is a technique used to determine how changes in an independent variable influence a particular dependent variable under given assumptions. This technique is used within specific boundaries that depend on one or more input variables, such as the effect that changes in interest rates have on bond prices.

In our topic, we have a well-trained deep neuron network with two high-dimensional input vectors $x_0, x_1$ and output $y_0=f(x_0), y_1=f(x_1)$. Now we know $x_1$ is a perturbation of $x_0$ and we want to know which element in $x_1 - x_0$ contributes the most to $y_1 - y_0$.

As one can imagine, if $\left| x_1 - x_0 \right|$ is small, the most "crude" approximation is to calculate

$\left . \frac{\partial y}{\partial x} \right|_{x = x_0} $

and get its largest element in terms of absolute value. This is well feasible because back-propagation enables us to calculate the differentials layer by layer. However, this method doesn't always work well.

Failures of traditional methods

Failures of traditional (derivative-based) methods fall into several categories:

First, derivatives are local and not quite useful for comparative analysis. Since derivatives are local, if the input to compare is far from reference input, it may not be appropriate to assign contribution based on derivatives. An example can be image recognizition, where change in a single pixel makes no sense. In some cases where values in an input are discrete (not continuous), derivatives are not available.

Second, using derivatives may mislead the analysis when derivatives have completely different behaviour in different parts of the output function. For example, on ReLU and (hard) max layers, taking derivative may lead to the conclusion that changes in all values make no difference as long as they don't touch the boundary.

DeepLIFT scheme

DeepLIFT assigns contribution scores $C_{\Delta y / \Delta x_i} = m_{\Delta y / \Delta x_i} \Delta x_i$ to each element $x_i$ so that sum of contribution scores satisfies $\sum_{i=1}^N C_{\Delta y / \Delta x_i} = \Delta y$.

Chain Rule

First, by the appendix of [1] we know the DeepLIFT multiplier $m$ behaves just like derivatives and satisfies the chain rule: If $z = z\left( y(x_1,...,x_N) \right)$ then

$m_{\Delta z / \Delta x_i} = m_{\Delta z / \Delta y} m_{\Delta y / \Delta x_i} $

Linear Rule

Second, let's consider a simple neuron $y = f(s)$ where $s = \sum_{i=1}^N w_i x_i$. We separate the positive and negative contribution of each $x_i$ and assign $m_{\Delta s / \Delta x_i}$ by the following rule, named as linear rule. The positive contribution of $x_i$ to $s$ is defined as

$ \begin{align} \Delta s^{+} & = \sum_{i=1}^N 1_{\left\{ w_i \Delta x_i > 0 \right\}} w_i \Delta x_i \\ & = \sum_{i=1}^N \left( 1_{\left\{ w_i > 0 \right\}} w_i \Delta x_{i}^{+} + 1_{\left\{ w_i < 0 \right\}} w_i \Delta x_{i}^{-} \right) \\ \end{align} $

We then assign the secant as $ m_{\Delta s^{+} / \Delta x_{i}^{+} } = 1_{\left\{ w_i > 0 \right\}} w_i $ and $ m_{\Delta s^{+} / \Delta x_{i}^{-} } = 1_{\left\{ w_i < 0 \right\}} w_i $. Similarly, we have $ m_{\Delta s^{-} / \Delta x_{i}^{+} } = 1_{\left\{ w_i < 0 \right\}} w_i $ and $ m_{\Delta s^{-} / \Delta x_{i}^{-} } = 1_{\left\{ w_i > 0 \right\}} w_i $. As to the occasion when $\Delta x_{i} = 0$, we let $ m_{\Delta s^{\pm} / \Delta x_{i}^{\pm} } = \frac{1}{2} w_i $.

Rescale Rule

For the function $f(s)$, it is possible to use the easiest method called rescale rule:

$ \Delta y = \frac{\Delta y}{\Delta s} \Delta s = \frac{\Delta y}{\Delta s} \left( \Delta s^{+} + \Delta s^{-} \right) $

Reveal-Cancel Rule

Rescale rule works for simple functions such as ReLU, but it does not always work well especially for some cases like pooling layers. To solve this we introduce reveal-cancel rule:

Suppose our reference input is $x_0$ and $s_0 = \sum_{i=1}^N w_i x_{0,i}$ is the sum in $y_0 = f(s_0)$. We define:

$ \Delta y^{+} = \frac{1}{2} \left[ f(s_0 + \Delta s^{+}) - f(s_0) \right] + \frac{1}{2} \left[ f(s_0 + \Delta s^{+} + \Delta s^{-}) - f(s_0 + \Delta s^{-}) \right] $

$ \Delta y^{-} = \frac{1}{2} \left[ f(s_0 + \Delta s^{-}) - f(s_0) \right] + \frac{1}{2} \left[ f(s_0 + \Delta s^{+} + \Delta s^{-}) - f(s_0 + \Delta s^{+}) \right] $

$ m_{\Delta y / \Delta s^{+} } = \Delta y^{+} / \Delta s^{+} , m_{\Delta y / \Delta s^{-} } = \Delta y^{-} / \Delta s^{-} $

Adjustments for softmax layers

Since softmax layer normalizes its input, we can let the contribution of softmax layer $y=softmax(z)$ be that of its preceding layer $z = z(x)$ minus the average contribution of that preceding layer:

$ C^{\prime}_{\Delta z_i / \Delta x} = C_{\Delta z_i / \Delta x} - \frac{1}{n} \sum_{j=1}^{N} C_{\Delta z_j / \Delta x} $

Given the rules above, it is easy to calculate $ m_{\Delta y / \Delta x_i^{\pm} } $ for each $x_i$, and thus to calculate the DeepLIFT multiplier $m$ and contribution $C$ of a certain input and output compared to a given reference input $x_0$. It is suggested by the author that reference input should be case-specific and no general rule for choosing $x_0$ is currently available.

Numerical results

The authors performed two main experiments to test whether the suggestions in DeepLIFT method are based on correct "understanding" of what the model is doing.

MNIST handwriting re-morphing

Suppose we have a well-trained MNIST handwriting recognition model which can identify 0-9 correctly. Now we have a hand-written 8, and we want to know "how can we erase a part of this handwritten 8 to change it, say, to 3?"

In this test, we take our reference input as all-zeros (black image) as this is the background of the images. We subtract contribution scores of pixels of class 8 to that of class 3, and erase 20% of pixels with highest contribution score. It appears that DeepLIFT can identify the left side of 8 as to be erased, which is far better that other gradient-based methods.

DNA motif detection

Suppose we have a long sequence of DNA $\left\{ x_n \right\}$, and we have a neuron network detecting whether the sequence contains GATA motif (a short DNA sub-sequence) or TAL motif. The training dataset is a mixture of randomly generated DNA sequences and real-world DNA sequences equipped with GATA and TAL, and the network is well-trained. We know the behavior of the neuron network:

  • The network adhere tag (0,0) for these sequences which are believed to be randomly generated.
  • The network adhere tag (1,0) for these sequences which are believed to be real-world and contains GATA but not TAL, and (0,1) for these sequences which contains TAL but not GATA.
  • The network adhere tag (1,1) for these sequences which are believed to be real-world and contains both GATA and TAL.

The question is: "What makes sequences with tag (1,0) different from these with tag (0,0)?" The answer should be "Whether it contains GATA sequence". Now we sample a sequence $\left\{ a_n \right\}$ with tag (1,0) and another one randomly generated with tag (0,0), $\left\{ b_n \right\}$, as reference, and assign contribution scores to the sequence by different schemes. We expect a good analyzing scheme to show us the answer above by highlighting the GATA motif in $\left\{ a_n \right\}$. The authors of [1] show that DeepLIFT performs much better than other schemes in highlighting the requested motif.


[1] Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv:1704.02685

[2] Video tutorial to DeepLIFT: http://goo.gl/qKb7pL