Hierarchical Question-Image Co-Attention for Visual Question Answering

From statwiki
Revision as of 03:36, 21 November 2017 by S6kalra (talk | contribs)
Jump to navigation Jump to search

Introduction

Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images in natural language as illustrated in Figure 1.

Figure 1: Figure illustrates a VQA system; whereby machine learning algorithm responds a natural language answer to a visual question asked by a user for the given image (ref: http://www.visualqa.org/static/img/challenge.png)

Recently, visual-attention based models have gained traction for VQA task, where the attention mechanism typically produces a spatial map highlighting image regions relevant to answering the visual question about image. However, to correctly answer the question, machine not only needs to understand or "attend" regions in the image but also the parts of question as well. In this paper, authors have described a novel co-attention technique to combine "where to look" or visual-attention along with "what words to listen to" or question-attention. The proposed co-attention mechanism for VQA allows model to jointly reasons about image and question thus improving upon the state of art results.

What are "attention" models?

Please feel free to skip this section if you already know about "attention" in context of deep learning. Since this paper talks about "attention" almost everywhere, I decided to put this section to give very informal and brief introduction to the concept of the "attention" mechanism specially visual "attention", however, it can be expanded to any other type of "attention".

Visual attention in CNN is inspired by the biological visual system. As humans, we have ability to focus our cognitive processing onto a subset of the environment that is more relevant for the given situation. Imagine, you witness a bank robbery where robbers are trying to escape on a car, as a good citizen, you will immediately focus your attention on number plate and other physical features of the car and robbers in order to give your testimony later. Such selective visual attention for a given context can also be implemented on traditional CNNs making them more superior for VQA tasks.

Generally speaking, most common and easy to implement form of "attention" mechanism is called "soft attention". In soft attention, network tries to learn the conditional distribution [math]\displaystyle{ P_{i \in [1,n]}(Li|c) }[/math] representing importance of the features extracted from each of the dsicrete [math]\displaystyle{ n }[/math] number of locations within image conditioned on some given context vector [math]\displaystyle{ c }[/math].