# DCN plus: Mixed Objective And Deep Residual Coattention for Question Answering

## Contents

## Introduction

Question Answering(QA) is one of the challenging computer science tasks that need an understanding of the natural language and the ability to reason efficiently. To accurately answer the question, the model must first have a detailed understanding of the context the question is being asked from. Because the questions are usually very detailed, having a shallow knowledge from the context would lead to poor and unacceptable performance. Moreover, The model should gather all the information provided in the question and match them with its knowledge from the context. Generating the answer is another interesting task. Based on the dataset the model is meant for, the output of the model might be in a completely different form. In the past years, QA datasets have improved significantly. Previous datasets were really simple and they usually did not simulate a real-world question-answer pair. For example, Children's book test was one of the popular datasets that have been used for QA for a long time. But the real task for this dataset was to just fill empty spaces in given sentences with the appropriate words. During the past years, the importance of the QA tasks and their practical uses encouraged many to gather and crowdsource useful and more realistic datasets. The Stanford Question Answering Dataset(SQuAD), Microsoft MAchine Reading COmprehension Dataset(MS MARCO), and Visual Question Answering Dataset(VQA) are only a few examples of the currently advanced datasets. As a result of these advancements, many researchers are focusing to improve the performance of the question answering models on these datasets. Deep neural networks were able to outperform the human accuracy on a few of these datasets, but in many cases, there is still a gap between the state-of-the-art and human performance. Previously, Dynamic Coattention Networks(DCN) proved to be efficient on the SQuAD, achieving state-of-the-art performance at the time. In this work, a further modification to DCN has been done which improves the accuracy of the model by proposing a mixed objective that combines cross entropy loss with self-critical policy learning.

## Overview of previous work

Most of the current QA models are made from different modules and usually stacked on top of each other. Improving one of the modules would lead to an overall performance of the model. Thus, to evaluate the efficiency of an improvement, researchers usually take a previously submitted model and replace their own improved module with the current one in the model. This is mostly because QA is an interesting discipline and has practical uses.

- Embedding layer: This layer maps each word (or images in the case of visual QA) to a vector space. There are many options to choose for the embedding layer. While pre-trained GloVes or Word2Vecs showed promising results on many tasks, most models use a combination of GloVe and character level embeddings. The character level embeddings are especially useful when dealing with out-of-vocab words. In the case of dealing with images, the embeddings are usually generated using pre-trained ResNets. Using different embedding layers for images has shown to change the overall performance of the model drastically.
- Contextual_layer: The purpose of this layer is to add more features to each word embedding based on the surrounding words and the context. This layer is not presented in many models including the DCN.
- Attention layer: There has been a lot of investigation on the attention mechanisms in recent years. These works, mostly inspired by Bahdanau et al. (2014), try to either modify the basic matrix-based attention mechanism or to develop innovative ones. The sole purpose of the attention mechanism is to make the model able to understand a context, based on the information gathered from somewhere else. For example, in image-based QA, attention layer helps the model to understand the question based on the information provided in the image such as object classes. This way, the model can realize what parts of the question are more important.
- Output layer: This is the final layer of all models, generating the answer of the question based on the information provided from all the previous layers.

## DCN+ structure

The DCN+ is an improvement on the previous DCN model. The overall structure of the model is the same as before. The first improvement is on the coattention module. By introducing a deep residual coattention encoder, the output of the attention layer becomes more feature-rich. The second improvement is achieved by mixing the previous cross-entropy loss with reinforcement learning rewards from self-critical policy learning. DCN+ has a decoder module that is only applicable to the SQuAD dataset since the decoder only predicts an answer span from the given context.

### Deep residual coattention encoder

The previous coattention module was unable to grasp complex information based on the context and the question. Recent studies showed that stacked attention mechanisms are outperforming the single layer attention modules. In DCN+, the coattention module is stacked to make it able to self-attend to the context and grasp more information. The second modification is to use residual connectors when merging the coattention output from each layer.

let [math]L^D \in R^{m×d}[/math] and [math]L^Q \in R^{n×d}[/math] denote the word embedding for the context and the question respectively. Here, [math]d, m, n[/math] are the embedding vector size, document word count, and question word count respectively. The model uses a bidirectional LSTM as the contextual layer with shared wights. Also, an additional sentinel token is added at the end of the document and question to make it possible for the model to distinguish between the document and question. [math]E^D[/math] and [math]E^Q[/math] are outputs of the encoder(contextual) layer.

\begin{align} E_1^D = BiLSTM_1(L^D) \in R^{(h×(m+1))} \end{align} \begin{align} E_1^Q = tanh(W BiLSTM_1(L^Q) \in R^{(h×(n+1))} \end{align}

Here [math]h[/math] is the hidden size of the LSTM. The affinity matrix is created based on the output of the encoder. The affinity matrix is the matrix that the has been used in the attention module from the introduction of attention. By performing a column-wise softmax function on the affinity matrix a vector would be generated that is a representation of the importance of each question token, based on the model's understanding of the context. Similarly, if a row-wise softmax function is applied to the affinity matrix, the output vector would represent the importance of each context word, based on the question. By multiplying these vectors to the outputs of the encoder layer, question-aware context and context-aware question representations would be created.

\begin{align} A = {(E_1^D)}^T E_1^Q \in R^{(m+1)×(n+1)} \end{align} \begin{align} {S_1^D} = E_1^Q softmax(A^T) \in R^{h×(m+1)} \end{align} \begin{align} {S_1^Q} = E_1^D softmax(A) \in R^{h×(n+1)} \end{align}

To make the question-aware context representation even deeper and more feature-rich. The model defines the final context representation as follows:

\begin{align} {C_1^D} = S_1^Q softmax(A^T) \in R^{h×m} \end{align}

Note that the model drops the dimension corresponding to the sentinel vector. The summaries also get encoded after this stage, using two bidirectional LSTMs with shared variables.

\begin{align} {E_2^D} = BiLSTM_2(S_1^Q) \in R^{2h×m} \end{align} \begin{align} {E_2^Q} = BiLSTM_2(S_1^D) \in R^{2h×n} \end{align}

Finally, The [math]E_1^D[/math] and <E_1^Q> are the output of the coattention module. The coattention module can easily get stacked to create a deeper attention mechanism. The final output of the stacked coattention units is obtained as:

\begin{align} U = BiLSTM(concat(E_1^D;E_2^D;S_1^D;S_2^D;C_1^D;C_2^D) \in R^{2h×m} \end{align}

### Mixed objective using self-critical policy learning

DCN produces a distribution over that start and end positions of the answer span. Because of the dynamic nature of the decoder module, it estimates separate distributions over the start and end position of the answer dynamically.

\begin{align} l_{ce}(\theta) = - \sum_{t} (log \ p_t^{start}(s|s_{t-1},e_{t-1};\theta) + log \ p_t^{end}(e|s_{t-1},e_{t-1};\theta)) \end{align}

In the above equation, [math]s[/math] and [math]e[/math] denote the respective start and end points of the ground truth answer. [math]s_t[/math] and [math]e_t[/math] denote the greedy estimation of the start and end positions at the [math]t[/math]th decoding time step. Similarly, [math]p_t^{start} \in R^m[/math] and [math]p_t^{end} \in R^m[/math] denote the distribution of the start and end positions respectively. The problem with the above loss functions is that it does not consider the F1 metric for evaluation of the model. There are two metrics to estimate QA models accuracy. The first metric is the exact match and it is a binary score. If the answer string does not match with the ground truth answer even by a single character, the exact match score would be zero. The second metric is the F1 score. F1 score is basically the degree of the overlap between the predicted answer and the ground truth. For example, suppose there are more than two correct answer spans in a context, [math]A[/math] and [math]B[/math], but none of the match the ground truth positions. If A has an exact string match but B does not, The cross-entropy loss would penalize both of them equally. However, if we include can F1 scores in our calculations, the loss function would penalize B and not A. To deal with this problem, DCN+ uses a self-critical reinforcement learning objective.

\begin{align} l_{rl}(\theta) = -E_{\hat{\tau} \sim p_\tau} [R(s,e,\hat{s}_T,\hat{e}_T;\theta)] \end{align}

\begin{align} \approx -E_{\hat{\tau} \sim p_\tau} [F_1 (ans(\hat{s}_T, \hat{e}_T), ans(s, e)) - F_1(ans(s_T, e_T), ans(s, e))] \end{align}

Here [math]\hat{s} \sim p_t^{start}[/math] and [math]\hat{e} \sim p_t^{end}[/math] denote the sampled start and end positions respectively from the estimated distributions at [math]t[/math]th decoding step. [math]\hat{\tau}[/math] is a trajectory as a sequence of sampled start and end positions during all [math]T[/math] decoder steps and [math]R[/math] is the expected reward. Previous studies show that using a baseline for the reward reduces the variance of gradient estimates and facilitates convergence. The second term in the above equation is the baseline. DCN+ uses a self-critic that uses the F1 produced during greedy inference by the current model.

## Experiments

To achieve optimal performance, the hyperparameters and training environment are fine-tuned. For tokenizing the documents, the Stanford CoreNLP reversible tokenizers has been used. For word embeddings, a pre-trained GloVE (trained on 840B common crawl) is used. The optimizer has been set to Adam and a dropout is also applied on word embeddings that zeros a word embedding with a probability of 0.075.

## Results

At the time of submission, the model was able to achieve state-of-the-art results on the SQuAD, outperforming the second model on the leaderboard by 2.0% both on the exact match and F1 scores. It is worth mentioning that a 5% improvement was also achieved with respect to the original DCN model.

In general, DCN+ was able to a achieve consistent performance improvement in almost every question category.