Hierarchical Question-Image Co-Attention for Visual Question Answering: Difference between revisions
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
= Introduction = | = Introduction = | ||
Visual Question Answering (VQA) is a recent problem in computer vision and | '''Visual Question Answering (VQA)''' is a recent problem in computer vision and | ||
natural language processing that has garnered a large amount of interest from | natural language processing that has garnered a large amount of interest from | ||
the deep learning, computer vision, and natural language processing communities. | the deep learning, computer vision, and natural language processing communities. | ||
Line 10: | Line 10: | ||
[[File:vqa-overview.png|thumb|800px|center|Figure 1: Figure illustrates a VQA system; whereby machine learning algorithms takes an image and a text-based visual question about the image as input and outputs the answer for the visual question in natural language (ref: http://www.visualqa.org/static/img/challenge.png)]] | [[File:vqa-overview.png|thumb|800px|center|Figure 1: Figure illustrates a VQA system; whereby machine learning algorithms takes an image and a text-based visual question about the image as input and outputs the answer for the visual question in natural language (ref: http://www.visualqa.org/static/img/challenge.png)]] | ||
Recently, visual attention based models have | Recently, ''visual-attention'' based models have gained traction for VQA task, where the | ||
attention mechanism typically produces a spatial map highlighting image regions | attention mechanism typically produces a spatial map highlighting image regions | ||
relevant to answering the question. However, to correctly answer | relevant to answering the visual question about image. However, to correctly answer the | ||
question | question, machine not only needs to understand or "attend" | ||
regions in the image but | regions in the image but also the parts of question as well. In this paper, authors have described a novel ''co-attention'' | ||
question as well. In this paper, authors have | |||
technique to combine "where to look" or visual-attention along with "what words | technique to combine "where to look" or visual-attention along with "what words | ||
to listen to" or question-attention. The co-attention mechanism for VQA allows | to listen to" or question-attention. The proposed co-attention mechanism for VQA allows | ||
model to jointly reasons about image and question thus improving the state of | model to jointly reasons about image and question thus improving upon the state of | ||
art results. | art results. | ||
== What are "attention" models? == |
Revision as of 02:00, 21 November 2017
Introduction
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images in natural language as illustrated in Figure 1.
Recently, visual-attention based models have gained traction for VQA task, where the attention mechanism typically produces a spatial map highlighting image regions relevant to answering the visual question about image. However, to correctly answer the question, machine not only needs to understand or "attend" regions in the image but also the parts of question as well. In this paper, authors have described a novel co-attention technique to combine "where to look" or visual-attention along with "what words to listen to" or question-attention. The proposed co-attention mechanism for VQA allows model to jointly reasons about image and question thus improving upon the state of art results.