Hierarchical Question-Image Co-Attention for Visual Question Answering
Introduction
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images in natural language as illustrated in Figure <xr="fig:vqa-overview"/>.
<figure id="fig:vqa-overview">
![](/statwiki/images/thumb/1/1e/vqa-overview.png/800px-vqa-overview.png)
.
</figure>