stat946F18/Beyond Word Importance Contextual Decomposition to Extract Interactions from LSTMs: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs")
 
No edit summary
Line 1: Line 1:
Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs
Not complete yet
 
== Introduction ==
The main reason behind the recent success of LSTM(Long Short-Term Memory Networks) and deep neural networks, in general, has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition(CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model.
 
==Overview of previous work==
The authors offer two motivations for their work:
# To translate between languages for which large parallel corpora does not exist
# To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield
 
 
==Long Short-Term Memory Networks==
Over the past few years LSTM have become a core component of neural NLP system and sequence modelling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference being that the modules are little more complicated. Instead of having one neural network layer like in RNN, they have four (called gates), interacting in a special way. Additionally they have a cell state which runs through the entire chain of network. It helps in managing the information from the previous cells in the chain.
 
Let's now define it more formally and mathematically. Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d1}</math>, a cell and state vector <math>c_t, h_t \in R^{d2}</math> are computed for each element by iteratively applying the below equations, with initialization <math>h_0 = c_0 = 0</math>.
 
Given a sequence of word embeddings <math>x_1, ..., x_T \in R^{d_1}</math>, a cell and state vector <math>c_t, h_t \in R^{d_2}</math> are computed for each element by iteratively applying the below equations, with initializations <math>h_0 = c_0 = 0</math>.
 
\begin{align}
o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o) (1)
\end{align}
\begin{align}
f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f) (2)
\end{align}
\begin{align}
i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i) (3)
\end{align}
\begin{align}
g_t = tanh(W_gx_t + V_gh_{t−1} + b_g) (4)
\end{align}
\begin{align}
c_t = f_t \odot c_{t−1} + i_t \odot g_t (5)
\end{align}
\begin{align}
h_t = o_t \odot tanh(c_t)
\end{align}
 
Where <math>W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} </math> and denotes element-wise multiplication. <math> o_t, f_t </math> and <math> i_t </math> are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication.
 
After processing the full sequence of words, the final state <math>h_T</math> is fed to a multinomial logistic regression, to return a probability distribution over C classes.
 
\begin{align}
p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T)  }
\end{align}
==Contextual Decomposition(CD) of LSTM==
CD decomposes the output of the LSTM into a sum of two contributions:
# those resulting solely from the given phrase
# those involving other factors
 
One important thing that is crucial to understand that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence.
 
Now let's define this more formally. Let the arbitrary input phrase be <math>x_q, ..., x_r</math>, where <math>1 \leq q \leq r \leq T </math>, where T represents the length of the sentence. CD decomposes the output and cell state (<math>c_t, h_t </math>) of each cell into a sum of 2 contributions as shown in the equations below.
 
\begin{align}
h_t = \beta_t + \gamma_t
\end{align}
\begin{align}
c_t = \beta_t^c + \gamma_t^c
\end{align}
 
In the decomposition <math>\beta_t </math> and <math> \gamma_t </math> corresponds to the contributions given to <math> h_t </math> solely from the given phrase and from the other factors respectively. Similarly, <math>\beta_t^c </math> and <math> \gamma_t^c </math> represents the contributions given to <math> c_t </math> solely from the given phrase and from the other factors respectively.
 
Using this decomposition the final output state <math>Wh_T </math> is given by
\begin{align}
p = SoftMax(W\beta_T + W\gamma_T)
\end{align}
 
As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.
 
==Disambiguation Interaction between gates==
 
Intuition
In the equations for the calculation of <math>i_t </math> and <math>g_t </math> in the LSTM, we use the contribution at that time step, <math>x_t</math> as well the output of the previous state <math>h_t</math>. Therefore when the <math>i_t \odot g_t</math> is calculated, the contributions made by <math>x_t</math> to <math>i_t</math> interact with contributions made by <math>h_t</math> to <math>g_t</math> and vice versa. This insight is used to construct the decomposition.
 
At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have
\begin{align}
i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\
& = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i)
\end{align}

Revision as of 03:05, 18 October 2018

Not complete yet

Introduction

The main reason behind the recent success of LSTM(Long Short-Term Memory Networks) and deep neural networks, in general, has been their ability to model complex and non-linear interactions. Our inability to fully comprehend these relationships has led to these state-of-the-art models being regarded as black-boxes. The paper "Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs" by W. James Murdoch, Peter J. Liu, and Bin Yu propose an interpretation algorithm called Contextual Decomposition(CD) for analyzing individual predictions made by the LSTMs without any change to the underlying original model. The problem of sentiment analysis is chosen for the evaluation of the model.

Overview of previous work

The authors offer two motivations for their work:

  1. To translate between languages for which large parallel corpora does not exist
  2. To provide a strong lower bound that any semi-supervised machine translation system is supposed to yield


Long Short-Term Memory Networks

Over the past few years LSTM have become a core component of neural NLP system and sequence modelling systems in general. LSTMs are a special kind of Recurrent Neural Network(RNNs) which in many cases work better than the standard RNN by solving the vanishing gradient problem. To put it simply they are much more efficient in learning long-term dependencies. Like a standard RNN, LSTMs are made up of chains of repeating modules. The difference being that the modules are little more complicated. Instead of having one neural network layer like in RNN, they have four (called gates), interacting in a special way. Additionally they have a cell state which runs through the entire chain of network. It helps in managing the information from the previous cells in the chain.

Let's now define it more formally and mathematically. Given a sequence of word embeddings [math]\displaystyle{ x_1, ..., x_T \in R^{d1} }[/math], a cell and state vector [math]\displaystyle{ c_t, h_t \in R^{d2} }[/math] are computed for each element by iteratively applying the below equations, with initialization [math]\displaystyle{ h_0 = c_0 = 0 }[/math].

Given a sequence of word embeddings [math]\displaystyle{ x_1, ..., x_T \in R^{d_1} }[/math], a cell and state vector [math]\displaystyle{ c_t, h_t \in R^{d_2} }[/math] are computed for each element by iteratively applying the below equations, with initializations [math]\displaystyle{ h_0 = c_0 = 0 }[/math].

\begin{align} o_t = \sigma(W_ox_t + V_oh_{t−1} + b_o) (1) \end{align} \begin{align} f_t = \sigma(W_fx_t + V_fh_{t−1} + b_f) (2) \end{align} \begin{align} i_t = \sigma(W_ix_t + V_ih_{t−1} + b_i) (3) \end{align} \begin{align} g_t = tanh(W_gx_t + V_gh_{t−1} + b_g) (4) \end{align} \begin{align} c_t = f_t \odot c_{t−1} + i_t \odot g_t (5) \end{align} \begin{align} h_t = o_t \odot tanh(c_t) \end{align}

Where [math]\displaystyle{ W_o, W_i, W_f , W_g \in R^{{d_1}×{d_2}} , V_o, V_f , V_i , V_g \in R^{{d_2}×{d_2}}, b_o, b_g, b_i, b_g \in R^{d_2} }[/math] and denotes element-wise multiplication. [math]\displaystyle{ o_t, f_t }[/math] and [math]\displaystyle{ i_t }[/math] are often referred to as output, forget and input gates, respectively, due to the fact that their values are bounded between 0 and 1, and that they are used in element-wise multiplication.

After processing the full sequence of words, the final state [math]\displaystyle{ h_T }[/math] is fed to a multinomial logistic regression, to return a probability distribution over C classes.

\begin{align} p_j = SoftMax(Wh_T)_j = \frac{\exp(W_jh_T)}{\sum_{k=1}^C\exp(W_kh_T) } \end{align}

Contextual Decomposition(CD) of LSTM

CD decomposes the output of the LSTM into a sum of two contributions:

  1. those resulting solely from the given phrase
  2. those involving other factors

One important thing that is crucial to understand that this method does not affect the architecture or the predictive accuracy of the model in any way. It just takes the trained model and tries to break it down into the two components mentioned above. It takes in a particular phrase that the user wants to understand or the entire sentence.

Now let's define this more formally. Let the arbitrary input phrase be [math]\displaystyle{ x_q, ..., x_r }[/math], where [math]\displaystyle{ 1 \leq q \leq r \leq T }[/math], where T represents the length of the sentence. CD decomposes the output and cell state ([math]\displaystyle{ c_t, h_t }[/math]) of each cell into a sum of 2 contributions as shown in the equations below.

\begin{align} h_t = \beta_t + \gamma_t \end{align} \begin{align} c_t = \beta_t^c + \gamma_t^c \end{align}

In the decomposition [math]\displaystyle{ \beta_t }[/math] and [math]\displaystyle{ \gamma_t }[/math] corresponds to the contributions given to [math]\displaystyle{ h_t }[/math] solely from the given phrase and from the other factors respectively. Similarly, [math]\displaystyle{ \beta_t^c }[/math] and [math]\displaystyle{ \gamma_t^c }[/math] represents the contributions given to [math]\displaystyle{ c_t }[/math] solely from the given phrase and from the other factors respectively.

Using this decomposition the final output state [math]\displaystyle{ Wh_T }[/math] is given by \begin{align} p = SoftMax(W\beta_T + W\gamma_T) \end{align}

As this score corresponds to the input to a logistic regression, it may be interpreted in the same way as a standard logistic regression coefficient.

Disambiguation Interaction between gates

Intuition In the equations for the calculation of [math]\displaystyle{ i_t }[/math] and [math]\displaystyle{ g_t }[/math] in the LSTM, we use the contribution at that time step, [math]\displaystyle{ x_t }[/math] as well the output of the previous state [math]\displaystyle{ h_t }[/math]. Therefore when the [math]\displaystyle{ i_t \odot g_t }[/math] is calculated, the contributions made by [math]\displaystyle{ x_t }[/math] to [math]\displaystyle{ i_t }[/math] interact with contributions made by [math]\displaystyle{ h_t }[/math] to [math]\displaystyle{ g_t }[/math] and vice versa. This insight is used to construct the decomposition.

At this stage we need to make an assumption that the non-linear operations at the gate can be represented in a linear fashion. How this is done will be explained in a later part of the summary. Therefore writing equations 1 as a linear sum of contributions from the inputs we have \begin{align} i_t &= \sigma(W_ix_t + V_ih_{t−1} + b_i) \\ & = L_\sigma(W_ix_t) + L_\sigma(V_ih_{t−1}) + L_\sigma(b_i) \end{align}