stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 43: Line 43:
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T)  }
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T)  }
\end{align}
\end{align}
==Contextual Decomposition(CD) of LSTM==
==Contextual Decomposition(CD) of LSTM==
CD decomposes the output of the LSTM into a sum of two contributions:
CD decomposes the output of the LSTM into a sum of two contributions:
Line 78: Line 80:
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)
\end{align}
\end{align}
The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1})</math>), solely from the other factors (<math>L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1})</math>) or as an interaction between the phrase and other factors (<math>L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1})</math>).
Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations <math>\beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0</math>. The derivations can vary a little depending on cases whether the current time step is containted within the phrase (<math> q \leq t \leq r </math>) or not(<math> t < q, t > r</math>).
So essentially what we are going to do now is linearlize each of the gates, then expand the product of sums of these gates and then finally grouping the terms we get depending on which type of interaction they represent (solely from phrase, solely from other factors and combination of both).
Terms are determined to be derived solely from the specific phrase if they involve products from some combination of <math>\beta_{t-1}, \beta_{t-1}^c, x_t</math> and <math> b_i </math> or <math> b_g </math>(but not both). In the other case when t is not within the phrase, products involving <math> x_t </math> are treated as not deriving from the phrase.
\begin{align}
f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\
& = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\
& = \beta_t^f + \gamma_t^f
\end{align}
Similarly
\begin{align}
i_t \odot g_t = \beta_t^u + \gamma_t^u
\end{align}
Thus we can represent <math>c_t</math> as
\begin{align}
c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\
& = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\
& = \beta_t^c + \gamma_t^c
\end{align}
So once we have the decomposition of <math> c_t </math>, then we can rather simply calculate the transformation of <math> h_t </math> by linearizing the <math> tanh</math> function. Again at this point, we just assume that a linearizing function for <math> tanh </math> exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally <math> h_t </math> can be written as
\begin{align}
h_t &= o_t \odot tanh(c_t) \\
& = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\
& = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\
& = \beta_t + \gamma_t
\end{align}
==Linearizing activation functions ==
We now explain the big assumption that we took earlier about the linearizing functions <math> L_{\sigma} </math> and </math> L_{tanh} </math>. For arbitrary { <math> y_1, ..., y_N </math> } <math> \in R </math>, the problem that we intend to solve is essentially
\begin{align}
tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i)
\end{align}
In cases where {<math> y_i </math>} follow a natural ordering, work in Murdoch & Szlam, 2017 has used a method where the difference of partial sums is utilized as a linearization technique. This could be shown by the equation below
\begin{align}
L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j)
\end{align}
But in our cases the terms do not follow any particular ordering, for e.g. while calculating <math> o_t </math> we could write it as a sum of <math> W_ox_t, V_oh_{t−1}, b_o </math> or <math> b_o, V_oh_{t−1}, W_ox_t </math>. Thus, we an average over all the possible orderings. Let <math>\pi_i, ..., \pi_{M_n} </math> denote the set of all permutations of <math>1, ..., N</math>, the score is given below
\begin{align}
L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k - 1)}y_{\pi_i(j)})]
\end{align}
We can similarly derive <math> L_{\sigma} </math>. An important empirical observation to note here is that in case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term.
In our case the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example for a case where N=2 is shown below.
\begin{align}
L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)])
\end{align}
==Experiments==
As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper tries to acknowledge the following 3 tasks with the experiments:
# It should work on the standard problem of word-level importance scores
# It should behave for words as well phrases especially in situations involving compositionality.
# It should be able to extract instances of positive and negative negation.
An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The model was trained using Adam with the learning rate of 0.001 using early stopping on the validation set. Additionally, a bag of words linear model was used.
All the experiments were performed on the Stanford Sentiment Treebank(SST) dataset and the Yelp Polarity(YP) dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.
==Baselines==
The interpretations are compared with 4 state-of-the-art baselines for interpretability.
# Cell Decomposition(Murdoch & Szlam, 2017),
# Integrated Gradients (Sundararajan et al., 2017),
# Leave One Out (Li et al., 2016),
# Gradient times input
==Unigram(Word) Scores==
Logistic regression coefficients while being sufficiently accurate are considered the gold standard for interpretability. The way it's applied to say a task of sentiment analysis is by checking the values of the coefficients. Thus we would expect that the CD scores extracted from an LSTM, would have meaningful relationship and comparison with the logistic regression coefficients.
Details about visualization.
==Identifying Dissenting Subphrases==
First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase with different sentiments. The phrase is is considered to be of atmost 5 words. For example, consider the phrase "used to be my favorite". The word "favorite"  is strongly positive which is also shown by it having a high linear regression coefficient.

Revision as of 15:53, 18 October 2018

Not complete yet

Introduction

The main reason behind the recent success of LSTM(Long Short-Term Memory Networks) and deep neural networks, in general, has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition(CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model.

Overview of previous work

The authors offer two motivations for their work:

  1. To translate between languages for which large parallel corpora does not exist
  2. To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield


Long Short-Term Memory Networks

Over the past few years LSTM have become a core component of neural NLP system and sequence modelling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference being that the modules are little more complicated. Instead of having one neural network layer like in RNN, they have four (called gates), interacting in a special way. Additionally they have a cell state which runs through the entire chain of network. It helps in managing the information from the previous cells in the chain.

Let's now define it more formally and mathematically. Given a sequence of word embeddings [math]\displaystyle{ x_1, ..., x_T \in R^{d1} }[/math], a cell and state vector [math]\displaystyle{ c_t, h_t \in R^{d2} }[/math] are computed for each element by iteratively applying the below equations, with initialization [math]\displaystyle{ h_0 = c_0 = 0 }[/math].

Given a sequence of word embeddings [math]\displaystyle{ x_1, ..., x_T \in R^{d_1} }[/math], a cell and state vector [math]\displaystyle{ c_t, h_t \in R^{d_2} }[/math] are computed for each element by iteratively applying the below equations, with initializations [math]\displaystyle{ h_0 = c_0 = 0 }[/math].

\begin{align} o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o) (1) \end{align} \begin{align} f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f) (2) \end{align} \begin{align} i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i) (3) \end{align} \begin{align} g_t = tanh(W_gx_t + V_gh_{t−1} + b_g) (4) \end{align} \begin{align} c_t = f_t \odot c_{t−1} + i_t \odot g_t (5) \end{align} \begin{align} h_t = o_t \odot tanh(c_t) \end{align}

Where [math]\displaystyle{ W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} }[/math] and denotes element-wise multiplication. [math]\displaystyle{ o_t, f_t }[/math] and [math]\displaystyle{ i_t }[/math] are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication.

After processing the full sequence of words, the final state [math]\displaystyle{ h_T }[/math] is fed to a multinomial logistic regression, to return a probability distribution over C classes.

\begin{align} p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) } \end{align}


Contextual Decomposition(CD) of LSTM

CD decomposes the output of the LSTM into a sum of two contributions:

  1. those resulting solely from the given phrase
  2. those involving other factors

One important thing that is crucial to understand that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence.

Now let's define this more formally. Let the arbitrary input phrase be [math]\displaystyle{ x_q, ..., x_r }[/math], where [math]\displaystyle{ 1 \leq q \leq r \leq T }[/math], where T represents the length of the sentence. CD decomposes the output and cell state ([math]\displaystyle{ c_t, h_t }[/math]) of each cell into a sum of 2 contributions as shown in the equations below.

\begin{align} h_t = \beta_t + \gamma_t \end{align} \begin{align} c_t = \beta_t^c + \gamma_t^c \end{align}

In the decomposition [math]\displaystyle{ \beta_t }[/math] and [math]\displaystyle{ \gamma_t }[/math] corresponds to the contributions given to [math]\displaystyle{ h_t }[/math] solely from the given phrase and from the other factors respectively. Similarly, [math]\displaystyle{ \beta_t^c }[/math] and [math]\displaystyle{ \gamma_t^c }[/math] represents the contributions given to [math]\displaystyle{ c_t }[/math] solely from the given phrase and from the other factors respectively.

Using this decomposition the final output state [math]\displaystyle{ Wh_T }[/math] is given by \begin{align} p = SoftMax(W\beta_T + W\gamma_T) \end{align}

As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.

Disambiguation Interaction between gates

Intuition In the equations for the calculation of [math]\displaystyle{ i_t }[/math] and [math]\displaystyle{ g_t }[/math] in the LSTM, we use the contribution at that time step, [math]\displaystyle{ x_t }[/math] as well the output of the previous state [math]\displaystyle{ h_t }[/math]. Therefore when the [math]\displaystyle{ i_t \odot g_t }[/math] is calculated, the contributions made by [math]\displaystyle{ x_t }[/math] to [math]\displaystyle{ i_t }[/math] interact with contributions made by [math]\displaystyle{ h_t }[/math] to [math]\displaystyle{ g_t }[/math] and vice versa. This insight is used to construct the decomposition.

At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have \begin{align} i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\ & = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i) \end{align}

The important thing to notice now is that after using this linearization, the products between gates also become linear sums of contributions from the 2 factors mentioned above. To expand we can learn whether they resulted solely from the phrase ([math]\displaystyle{ L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\beta_{t-1}) }[/math]), solely from the other factors ([math]\displaystyle{ L_\sigma(b_i) \odot L_{tanh}(V_g\gamma_{t-1}) }[/math]) or as an interaction between the phrase and other factors ([math]\displaystyle{ L_\sigma(V_i\beta_{t-1}) \odot L_{tanh}(V_g\gamma_{t-1}) }[/math]).

Since we are able to calculate gradients values recursively in LSTMs, we would use the same procedure to recursively compute the decompositions with the initializations [math]\displaystyle{ \beta_0 = \beta_0^c = \gamma_0 = \gamma_0^c = 0 }[/math]. The derivations can vary a little depending on cases whether the current time step is containted within the phrase ([math]\displaystyle{ q \leq t \leq r }[/math]) or not([math]\displaystyle{ t \lt q, t \gt r }[/math]).

So essentially what we are going to do now is linearlize each of the gates, then expand the product of sums of these gates and then finally grouping the terms we get depending on which type of interaction they represent (solely from phrase, solely from other factors and combination of both).

Terms are determined to be derived solely from the specific phrase if they involve products from some combination of [math]\displaystyle{ \beta_{t-1}, \beta_{t-1}^c, x_t }[/math] and [math]\displaystyle{ b_i }[/math] or [math]\displaystyle{ b_g }[/math](but not both). In the other case when t is not within the phrase, products involving [math]\displaystyle{ x_t }[/math] are treated as not deriving from the phrase.

\begin{align} f_t\odot c_{t-1} &= (L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(V_f\gamma_{t-1}) + L_\sigma(b_f)) \odot (\beta_{t-1}^c + \gamma_{t-1}^c) \\ & = ([L_\sigma(W_fx_t) + L_\sigma(V_f\beta_{t-1}) + L_\sigma(b_f)] \odot \beta_{t-1}^c) + (L_\sigma(V_f\gamma_{t-1}) \odot \beta_{t-1}^c + f_t \odot \gamma_{t-1}^c) \\ & = \beta_t^f + \gamma_t^f \end{align}

Similarly

\begin{align} i_t \odot g_t = \beta_t^u + \gamma_t^u \end{align}

Thus we can represent [math]\displaystyle{ c_t }[/math] as

\begin{align} c_t &= \beta_t^f + \gamma_t^f + \beta_t^u + \gamma_t^u \\ & = \beta_t^f + \beta_t^u + \gamma_t^f + \gamma_t^u \\ & = \beta_t^c + \gamma_t^c \end{align}

So once we have the decomposition of [math]\displaystyle{ c_t }[/math], then we can rather simply calculate the transformation of [math]\displaystyle{ h_t }[/math] by linearizing the [math]\displaystyle{ tanh }[/math] function. Again at this point, we just assume that a linearizing function for [math]\displaystyle{ tanh }[/math] exists. Similar to the decomposition of the forget and input gate we can decompose the output gate as well but empirically it was found that it did not produce improved results. So finally [math]\displaystyle{ h_t }[/math] can be written as

\begin{align} h_t &= o_t \odot tanh(c_t) \\ & = o_t \odot [L_{tanh}(\beta_t^c) + L_{tanh}(\gamma_t^c)] \\ & = o_t \odot L_{tanh}(\beta_t^c) + o_t \odot L_{tanh}(\gamma_t^c) \\ & = \beta_t + \gamma_t \end{align}

Linearizing activation functions

We now explain the big assumption that we took earlier about the linearizing functions [math]\displaystyle{ L_{\sigma} }[/math] and </math> L_{tanh} </math>. For arbitrary { [math]\displaystyle{ y_1, ..., y_N }[/math] } [math]\displaystyle{ \in R }[/math], the problem that we intend to solve is essentially \begin{align} tanh(\sum_{i=1}^Ny_i) = \sum_{i=1}^NL_{tanh}(y_i) \end{align}

In cases where {[math]\displaystyle{ y_i }[/math]} follow a natural ordering, work in Murdoch & Szlam, 2017 has used a method where the difference of partial sums is utilized as a linearization technique. This could be shown by the equation below \begin{align} L^{'}_{tanh}(y_k) = tanh(\sum_{j=1}^ky_j) - tanh(\sum_{j=1}^{k-1}y_j) \end{align}

But in our cases the terms do not follow any particular ordering, for e.g. while calculating [math]\displaystyle{ o_t }[/math] we could write it as a sum of [math]\displaystyle{ W_ox_t, V_oh_{t−1}, b_o }[/math] or [math]\displaystyle{ b_o, V_oh_{t−1}, W_ox_t }[/math]. Thus, we an average over all the possible orderings. Let [math]\displaystyle{ \pi_i, ..., \pi_{M_n} }[/math] denote the set of all permutations of [math]\displaystyle{ 1, ..., N }[/math], the score is given below

\begin{align} L_{tanh}(y_k) = \frac{1}{M_N}\sum_{i=1}^{M_N}[tanh(\sum_{j=1}^{\pi_i^{-1}(k)}y_{\pi_i(j)}) - tanh(\sum_{j=1}^{\pi_i^{-1}(k - 1)}y_{\pi_i(j)})] \end{align}

We can similarly derive [math]\displaystyle{ L_{\sigma} }[/math]. An important empirical observation to note here is that in case when one of the terms of the decomposition is a bias, improvements were seen when restricting to permutations where the bias was the first term. In our case the value of N only ranges from 2 to 4, which makes the linearization take very simple forms. An example for a case where N=2 is shown below.

\begin{align} L_{tanh}(y_1) = \frac{1}{2}([tanh(y_1) - tanh(0)] + [tanh(y_2 + y_1) - tanh(y_1)]) \end{align}


Experiments

As mentioned earlier, the empirical validation of CD is done on the task of sentiment analysis. The paper tries to acknowledge the following 3 tasks with the experiments:

  1. It should work on the standard problem of word-level importance scores
  2. It should behave for words as well phrases especially in situations involving compositionality.
  3. It should be able to extract instances of positive and negative negation.

An important fact worth mentioning again is that the primary objective of the paper is to produce meaningful interpretations on a pre-trained LSTM model rather than achieving the state-of-the-art results on the task of sentiment analysis. Therefore standard practices are used for tuning the models. The models are implemented in Torch using default parameters for weight initializations. The model was trained using Adam with the learning rate of 0.001 using early stopping on the validation set. Additionally, a bag of words linear model was used.

All the experiments were performed on the Stanford Sentiment Treebank(SST) dataset and the Yelp Polarity(YP) dataset. SST is a standard NLP benchmark which consists of movie reviews ranging from 2 to 52 words long. The LSTM model attained an accuracy of 87.2% whereas the logistic regression model with the bag of words features attained an accuracy of 83.2%. In case of YP, the task is to perform a binary sentiment classification task. The reviews considered were only which were of at most 40 words. The LSTM model attained a 4.6% error as compared the 5.7% error for the regression model.

Baselines

The interpretations are compared with 4 state-of-the-art baselines for interpretability.

  1. Cell Decomposition(Murdoch & Szlam, 2017),
  2. Integrated Gradients (Sundararajan et al., 2017),
  3. Leave One Out (Li et al., 2016),
  4. Gradient times input


Unigram(Word) Scores

Logistic regression coefficients while being sufficiently accurate are considered the gold standard for interpretability. The way it's applied to say a task of sentiment analysis is by checking the values of the coefficients. Thus we would expect that the CD scores extracted from an LSTM, would have meaningful relationship and comparison with the logistic regression coefficients. Details about visualization.

Identifying Dissenting Subphrases

First, the paper shows that the existing methods are not able to recognize sub-phrases in a phrase with different sentiments. The phrase is is considered to be of atmost 5 words. For example, consider the phrase "used to be my favorite". The word "favorite" is strongly positive which is also shown by it having a high linear regression coefficient.