Hierarchical Question-Image Co-Attention for Visual Question Answering

From statwiki
Revision as of 08:12, 22 November 2017 by S6kalra (talk | contribs)
Jump to navigation Jump to search

Paper Summary

Conference
  • NIPS 2016
  • Presented as spotlight oral: Youtube link
  • 85 citations so far
Authors Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
Abstract A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.

Introduction

Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images in natural language as illustrated in Figure 1.

Figure 1: Figure illustrates a VQA system; whereby machine learning algorithm responds an answer in a natural language for a visual question asked by the user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)

Recently, visual-attention based models have gained traction for VQA tasks, where the attention mechanism typically produces a spatial map highlighting image regions relevant for answering the visual question about the image. However, to correctly answer the question, machine not only needs to understand or "attend" regions in the image but also the parts of question as well. In this paper, authors have proposed a novel co-attention technique to combine "where to look" or visual-attention along with "what words to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving upon existing state of the art results.

"attention" models

You may skip this section if you already know about "attention" in context of deep learning. Since this paper talks about "attention" almost everywhere, I decided to put this section to give very informal and brief introduction to the concept of the "attention" mechanism specially visual "attention", however, it can be expanded to any other type of "attention".

Visual attention in CNN is inspired by the biological visual system. As humans, we have ability to focus our cognitive processing onto a subset of the environment that is more relevant for the given situation. Imagine, you witness a bank robbery where robbers are trying to escape on a car, as a good citizen, you will immediately focus your attention on number plate and other physical features of the car and robbers in order to give your testimony later. Such selective visual attention for a given context can also be implemented on traditional CNNs making them more superior for certains tasks and it even helps algorithm designer to visualize what localized features were more important than others.

Role of "visual-attention" in VQA

This section is not a part of the actual paper that is been summarized, however, it gives an overview of how visual attention can be incorporated in training of a network for VQA tasks, eventually, helping readers to absorb and understand actual proposed ideas from the paper more effortlessly.

Generally for implementing attention, network tries to learn the conditional distribution [math]\displaystyle{ P_{i \in [1,n]}(Li|c) }[/math] representing individual importance for all the features extracted from each of the dsicrete [math]\displaystyle{ n }[/math] locations within the image conditioned on some context vector [math]\displaystyle{ c }[/math]. In order words, given [math]\displaystyle{ n }[/math] features [math]\displaystyle{ L_i = [L_0, L_1, ..., L_n] }[/math] from [math]\displaystyle{ n }[/math] different regions within the image(top-left, top-middle, top-right, and so on), then "attention" module learns a parameteric function [math]\displaystyle{ F(c;\theta) }[/math] that outputs importance of each of these individual feature for a given context vector [math]\displaystyle{ c }[/math] or outputs a discrete probability distribution of size [math]\displaystyle{ n }[/math], can be achived by [math]\displaystyle{ softmax(n) }[/math].

In order to incorporate the visual attention in VQA task, one can define context vector [math]\displaystyle{ c }[/math] as a representation of the visual question asked by an user (using RNN perhaps LSTM) and generate a localized attention map which can than be used for end-to-end training purposes as shown in Figure 2. Most work that exists in literature regarding use of visual-attention in VQA tasks are generally further specialization of such similar ideas.

Figure 2: Different attention maps generated based on the given visual question

Motivation

So far, all attention models for VQA in literature have focused on the problem of identifying "where to look" or visual attention. In this paper, authors argue that the problem of identifying "which words to listen to" or question attention is equally important. Consider the questions "how many horses are in this image?" and "how many horses can you see in this image?". They have the same meaning, essentially captured by the first three words. A machine that attends to the first three words would arguably be more robust to linguistic variations irrelevant to the meaning and answer of the question. Motivated by this observation, in addition to reasoning about visual attention, authors also address the problem of question attention.

Main Contributions

  • A novel co-attention mechanism for VQA that jointly performs question-guided visual attention and image-guided question attention.
  • A hierarchical architecture to represent the question, and consequently construct image-question co-attention maps at 3 different levels: word level, phrase level and question level.
  • A novel convolution-pooling strategy at phase-level to adaptively select the phrase sizes whose representations are passed to the question level representation.
  • Results on VQA and COCO-QA and ablation studies to quantify the roles of different components in our model

Method

This section is broken down into three parts: (i) notations used within the paper, (ii) hierarchical representation of the visual question and (iii) the proposed co-attention mechanism.

Notations

Notation Explaination
$Q = \{q_1,...q_T\}$ One-hot encoding of a visual question with $T$ words. Paper uses three different representation og visual question, one for each level of hierarchy, they are as follows:
  1. $Q^w = \{q^w_1,...q^w_T\}$: Word level representation of visual question
  2. $Q^p = \{q^p_1,...q^p_T\}$: Phrase level representation of visual question
  3. $Q^s = \{q^s_1,...q^s_T\}$: Question level representation of visual question

$Q^{w,p,s}$ has exactly $T$ number of embeddings in it, regardless of its position in the hierarchy i.e. word, phrase or question.

$V = {v_1,..,v_N}$ $V$ represented various feature vectors from $N$ different locations within the given image. Therefore, $v_n$ is feature vector from the image at location $n$. One can extract these location sensitive features from convolution layer of CNN.
$\hat{v}^r$ and $\hat{q}^r$ The co-attention features of image and question at each level in the hierarchy where $r \in \{w,p,s\}$. Basically, its a sum of $Q$ or $V$ after the dot product with attention $a^q$ or $a^v$ at each level of hierarchy.

For example, at word level, $a^q$ and $a^v$ tell importance of each words in visual question and each locations within image respectively, whereas $\hat{q}^w$ and $\hat{v}^w$ are final features vectors representing question and image with attention map applied at word level.

Note: Throughout the paper, $W$ represents the learnable weights and biases are not used within the equations for simplicity (reader must assume it to exist).

Question Hierarchy

There are three levels of hierarchy to represent a visual question: (i) word, (ii) phrase and (iii) question level as discussed in follwing sub-sections. It is important of note, each level of hierarchy depends on the previous one, so, phrase level representations are extracted from word level and question level representation comes from phrase level.

Figure 3: Hierarchical question encoding (source: Figure 3 (a) of original paper on page #5)
Figure 4: Another figure illustrating hierarchical question encoding in details

Word Level

1-hot encoding of question's words $Q = \{q_1,..q_T\}$ are transformed into vector space (learned end-to-end) which represents word level embedding of the visual question i.e. $Q^w = \{q^w_1,...q^w_T\}$. This transformation can be learn end-to-end instead of some pretrained word2vec model.

Phrase Level

Phrase level embedding vectors are calculated by using 1-D convolutions on the word level embedding vectors. Concretely, at each word location, we compute the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram as illustrated by Figure 4. For the $t^{th}$ word, the convolution output with window size $s$ is given by

$$ \hat{q}^p_{s,t} = tanh(W_c^sq^w_{t:t+s-1}), \quad s \in \{1,2,3\} $$

Where $W_c^s$ is the weight parameters. These diferent n-grams features are combined together using $maxpool$ operator to obtain phrase-level features.

$$ q_t^p = max(\hat{q}^p_{1,t}, \hat{q}^p_{2,t}, \hat{q}^p_{3,t}), \quad t \in \{1,2,...,T\} $$