STAT946F17/ Learning Important Features Through Propagating Activation Differences
This is a summary of ICML 2017 paper [1].
Introduction
Deep neuron network is purported for its "black box" nature which is a barrier to adoption in applications where interpretability is essential. Also, the "black box" nature brings difficulty for analyzing and improving the structure of the model. In our topic paper, DeepLIFT method is presented to decompose the output of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. This is a form of sensitivity analysis and helps understand the model better.
Sensitivity Analysis
Sensitivity Analysis is a concept in risk management and actuarial science. According to [Invectopedia], a sensitivity analysis is a technique used to determine how changes in an independent variable influence a particular dependent variable under given assumptions. This technique is used within specific boundaries that depend on one or more input variables, such as the effect that changes in interest rates have on bond prices.
In our topic, we have a well-trained deep neuron network with two high-dimensional input vectors $x_0, x_1$ and output $y_0=f(x_0), y_1=f(x_1)$. Now we know $x_1$ is a perturbation of $x_0$ and we want to know which element in $x_1 - x_0$ contributes the most to $y_1 - y_0$.
As one can imagine, if $\left| x_1 - x_0 \right|$ is small, the most "crude" approximation is to calculate
$\left . \frac{\partial y}{\partial x} \right|_{x = x_0} $
and get its largest element in terms of absolute value. This is well feasible because back-propagation enables us to calculate the differentials layer by layer. However, this method doesn't always work well.
Failure of traditional methods
to be done
DeepLIFT scheme
DeepLIFT assigns contribution scores $C_{\Delta y / \Delta x_i} = m_{\Delta y / \Delta x_i} \Delta x_i$ to each element $x_i$ so that sum of contribution scores satisfies $\sum_{i=1}^N C_{\Delta y / \Delta x_i} = \Delta y$.
First, by the appendix of [1] we know the DeepLIFT multiplier $m$ behaves just like derivatives and satisfies the chain rule: If $z = z\left( y(x_1,...,x_N) \right)$ then
$m_{\Delta z / \Delta x_i} = m_{\Delta z / \Delta y} m_{\Delta y / \Delta x_i} $
Second, let's consider a simple neuron $y = f(s)$ where $s = \sum_{i=1}^N w_i x_i$. We separate the positive and negative contribution of each $x_i$ and assign $m_{\Delta s / \Delta x_i}$ by the following rule, named as linear rule. The positive contribution of $x_i$ to $s$ is defined as
$ \begin{align} \Delta s^{+} & = \sum_{i=1}^N 1_{\left\{ w_i \Delta x_i > 0 \right\}} w_i \Delta x_i \\ & = \sum_{i=1}^N \left( 1_{\left\{ w_i > 0 \right\}} w_i \Delta x_{i}^{+} + 1_{\left\{ w_i < 0 \right\}} w_i \Delta x_{i}^{-} \right) \\ \end{align} $
We then assign the secant as $ m_{\Delta s^{+} / \Delta x_{i}^{+} } = 1_{\left\{ w_i > 0 \right\}} w_i $ and $ m_{\Delta s^{+} / \Delta x_{i}^{-} } = 1_{\left\{ w_i < 0 \right\}} w_i $. Similarly, we have $ m_{\Delta s^{-} / \Delta x_{i}^{+} } = 1_{\left\{ w_i < 0 \right\}} w_i $ and $ m_{\Delta s^{-} / \Delta x_{i}^{-} } = 1_{\left\{ w_i > 0 \right\}} w_i $. As to the occasion when $\Delta x_{i} = 0$, we let $ m_{\Delta s^{\pm} / \Delta x_{i}^{\pm} } = \frac{1}{2} w_i $.
For the function $f(s)$, it is possible to use the easiest method called rescale rule:
$ \Delta y = \frac{\Delta y}{\Delta s} \Delta s = \frac{\Delta y}{\Delta s} \left( \Delta s^{+} + \Delta s^{-} \right) $
This method works for simple functions such as ReLU, but it does not always work well especially for some cases like pooling layers. To solve this we introduce reveal-cancel rule:
Suppose our reference input is $x_0$ and $s_0 = \sum_{i=1}^N w_i x_{0,i}$ is the sum in $y_0 = f(s_0)$. We define:
$ \Delta y^{+} = \frac{1}{2} \left[ f(s_0 + \Delta s^{+}) - f(s_0) \right] + \frac{1}{2} \left[ f(s_0 + \Delta s^{+} + \Delta s^{-}) - f(s_0 + \Delta s^{-}) \right] $
$ \Delta y^{-} = \frac{1}{2} \left[ f(s_0 + \Delta s^{-}) - f(s_0) \right] + \frac{1}{2} \left[ f(s_0 + \Delta s^{+} + \Delta s^{-}) - f(s_0 + \Delta s^{+}) \right] $
$ m_{\Delta y / \Delta s^{+} } = \Delta y^{+} / \Delta s^{+} , m_{\Delta y / \Delta s^{-} } = \Delta y^{-} / \Delta s^{-} $
Given the rules above, it is easy to calculate $ m_{\Delta y / \Delta x_i^{\pm} } $ for each $x_i$, and thus to calculate the DeepLIFT multiplier $m$ and contribution $C$ of a certain input and output compared to a given reference input $x_0$. It is suggested by the author that reference input should be case-specific and no general rule for choosing $x_0$ is currently available.
Numerical results
to be done
References
[1] Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagating Activation Differences. arXiv:1704.02685