Hierarchical Question-Image Co-Attention for Visual Question Answering: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 33: Line 33:
features of the car and robbers in order to give your testimony later. Such
features of the car and robbers in order to give your testimony later. Such
selective visual attention for a given context can also be  implemented on
selective visual attention for a given context can also be  implemented on
traditional CNNs making them more superior for VQA tasks.
traditional CNNs making them more superior for certains tasks and it even helps
algorithm designer to visualize what localized features were more important than others.


== Role of "visual-attention" in VQA ==
== Role of "visual-attention" in VQA ==
Line 44: Line 45:
distribution <math>P_{i \in [1,n]}(Li|c)</math> representing every individual importance for all the features  
distribution <math>P_{i \in [1,n]}(Li|c)</math> representing every individual importance for all the features  
extracted from each of the dsicrete <math>n</math> locations within the image  
extracted from each of the dsicrete <math>n</math> locations within the image  
conditioned on some context vector <math>c</math>. In order words, we think of a image as <math>n</math>  
conditioned on some context vector <math>c</math>. In order words, given <math>n</math> features
discrete sub-regions (top-left, top-middle, top-right, and so on)  and extract feature from each of these regions
<math>L_i = [L_0, L_1, ..., L_n]</math> from <math>n</math> different regions within the image(top-left, top-middle, top-right, and so on), then "attention" module learns a parameteric function <math>F(c;\theta)</math> that outputs  
<math>L_i = [L_0, L_1, ..., L_n]</math> and then we learn a parameteric function <math>F(c;\theta)</math> that outputs  
importance of each of these individual feature for a given context vector <math>c</math> or outputs a discrete probability distribution  
importance of each of these individual feature for a given context vector or outputs a discrete probability distribution  
of size <math>n</math>(<math>softmax(n)</math>).  
of size <math>n</math> or <math>softmax(n)</math>.  


In order to incorporate the visual attention in VQA task, one can define context vector <math>c</math>  
In order to incorporate the visual attention in VQA task, one can define context vector <math>c</math>  
as a  representation of the visual question asked by an user (using RNN perhaps LSTM) and generate a  
as a  representation of the visual question asked by an user (using RNN perhaps LSTM) and generate a  
localized attention map which can than be used for end-to-end training purposes.
localized attention map which can than be used for end-to-end training purposes.

Revision as of 05:34, 21 November 2017

Introduction

Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images in natural language as illustrated in Figure 1.

Figure 1: Figure illustrates a VQA system; whereby machine learning algorithm responds an answer in a natural language for a visual question asked by the user for a given image (ref: http://www.visualqa.org/static/img/challenge.png)

Recently, visual-attention based models have gained traction for VQA tasks, where the attention mechanism typically produces a spatial map highlighting image regions relevant for answering the visual question about the image. However, to correctly answer the question, machine not only needs to understand or "attend" regions in the image but also the parts of question as well. In this paper, authors have proposed a novel co-attention technique to combine "where to look" or visual-attention along with "what words to listen to" or question-attention VQA allowing their model to jointly reasons about image and question thus improving upon existing state of the art results.

"attention" models

Please feel free to skip this section if you already know about "attention" in context of deep learning. Since this paper talks about "attention" almost everywhere, I decided to put this section to give very informal and brief introduction to the concept of the "attention" mechanism specially visual "attention", however, it can be expanded to any other type of "attention".

Visual attention in CNN is inspired by the biological visual system. As humans, we have ability to focus our cognitive processing onto a subset of the environment that is more relevant for the given situation. Imagine, you witness a bank robbery where robbers are trying to escape on a car, as a good citizen, you will immediately focus your attention on number plate and other physical features of the car and robbers in order to give your testimony later. Such selective visual attention for a given context can also be implemented on traditional CNNs making them more superior for certains tasks and it even helps algorithm designer to visualize what localized features were more important than others.

Role of "visual-attention" in VQA

This section is not a part of the actual paper that is been summarized, however, it gives an overview of how visual attention can be incorporated in training of a network for VQA tasks. Eventually, helping readers to absorb and understand actual proposed ideas from the paper more effortlessly.

Generally speaking, most common and easy to implement form of "attention" mechanism is called "soft attention". In soft attention, network tries to learn the conditional distribution [math]\displaystyle{ P_{i \in [1,n]}(Li|c) }[/math] representing every individual importance for all the features extracted from each of the dsicrete [math]\displaystyle{ n }[/math] locations within the image conditioned on some context vector [math]\displaystyle{ c }[/math]. In order words, given [math]\displaystyle{ n }[/math] features [math]\displaystyle{ L_i = [L_0, L_1, ..., L_n] }[/math] from [math]\displaystyle{ n }[/math] different regions within the image(top-left, top-middle, top-right, and so on), then "attention" module learns a parameteric function [math]\displaystyle{ F(c;\theta) }[/math] that outputs importance of each of these individual feature for a given context vector [math]\displaystyle{ c }[/math] or outputs a discrete probability distribution of size [math]\displaystyle{ n }[/math]([math]\displaystyle{ softmax(n) }[/math]).

In order to incorporate the visual attention in VQA task, one can define context vector [math]\displaystyle{ c }[/math] as a representation of the visual question asked by an user (using RNN perhaps LSTM) and generate a localized attention map which can than be used for end-to-end training purposes.