stat441w18/Image Question Answering using CNN with Dynamic Parameter Prediction

From statwiki
Revision as of 01:55, 15 March 2018 by Y53zou (talk | contribs) (.)
Jump to navigation Jump to search

Image Question Answering using CNN with Dynamic Parameter Prediction

Presented by

Rosie Zou, Kye Wei, Glen Chalatov, Ameer Dharamshi

Introduction

Problem Setup (in words)

Historically, computer vision has been an incredibly challenging, yet equally important task for researchers. The ultimate goal of computer vision is to develop a system that can achieve a holistic understanding of an image to extract meaningful insights. After many years with limited progress towards this goal, the field had fallen out of the spotlight. However, the past decade has seen a resurgence of interest in computer vision research due to breakthroughs in deep learning and advancements in computational capabilities. Recent advances in the recognition capabilities of such systems has paved the way for further advances towards our ultimate objective.

Image question answering (Image Q&A) is the next stage. Image Q&A is the task of asking a system various questions about an image. This extends beyond simple recognition and requires detecting many objects within an image as well as finding some level of context in order to answer the question. The core challenge of Image Q&A is that the set of objects to detect and the level of understanding required is dependent on the question being asked. If the question was static, the model could be optimized based on the question. A broader Image Q&A model must be flexible in order to adapt to different questions.

From a model perspective, the question being posed must be incorporated into the model and processed at some point. Previous methods tend to extract image features from the image and descriptors from the question and present the intersection of these two sets as the answer. Previous models incorporate the question directly as an independent variable when determining the answer. This is written as:

Where ‘ahat’ is the best predicted answer, ‘omega’ is the set of all possible answers ‘a’, ‘I’ is the image, ‘q’ is the question, and ‘theta’ is the model parameter vector.

The authors of this paper propose using the question to set certain model parameters and thus use the question directly when extracting information from the picture. This is written as

Where ‘ahat’, ‘a’, ‘I’ are the same as above but ‘thetas’ is the vector of static parameters and ‘thetad(q)’ is the vector of dynamic parameters determined by the question.

This approach hopes to adapt the model to the specifics of the question. Intuitively, it is determining the answer based on the question instead of seeking the common features of both the question and the answer.

Previous and Related Works

As mentioned in the earlier section, one of the major goals in computer vision is to achieve holistic understanding. While relatively new interest in the computer vision community, Image Question Answering already has a growing number of researchers working on this problem. There has been many past and recent efforts on this front, for instance this non-exhaustive list of papers published between 2015 and 2016 (NIPS 2015 paper, ICCV 2015 paper, AAAI 2016 paper). One key commonality in these papers is that most, if not all, of the recognition problems are defined in a simple, controlled environment with a finite set of objectives. While the question-handling strategies differ from paper to paper, a general problem-solving strategy in these papers is to use CNNs for feature extraction from images prior to handling question sentences.

In contrast, there has been less efforts in solving various recognition problems simultaneously, which is what researchers in Image Question Answering is trying to achieve. As mentioned previously, other than one paper (NIPS 2014 paper) which utilizes a Bayesian framework, the majority of the papers listed above generally propose an overall deep learning network structure, which performs very well on public bench marks but tends to fall apart when question complexity increases. The reason lies fundamentally in the complexity of the English language:


CNN + bag-of-words -- challenges from language representation

This strategy first uses CNNs to process the given image and extracts an array of image features along with their respective probabilities. Then, it uses semantic image segmentation and symbolic question reasoning to process the given questions. This would extract another set of features from the given question. After the question sentence is tokenized and processed, it then finds the intersection between the features present in the image and the features present in the question. This approach does not scale with complexity of the image or question, since neither one of them has a finite representation.

To further explain the issue with generalization, we need to discuss more about the complexity of natural languages. Renowned linguist and philosopher Noam Chomsky introduced the Chomsky Hierarchy of Grammars in 1956, which is a hierarchical venn diagram of formal grammars. In Chomsky’s hierarchy, languages are viewed as sets of strings ordered by their complexity, where the higher the language is in the hierarchy, the more complex it is. Such languages have higher generative and expressive power than languages in the lower end of the hierarchy. Additionally, there exists an inverse relationship between the level of restriction of phrase-structure rules and a language’s expressive powers: languages with more restricted phrase-structure rules have lower expressive powers.

Context-free languages and regular languages, which are in the lowest two levels in the hierarchy, can be expressed and coded using pushdown automaton and deterministic finite automaton. Visually, these two classes of languages can be expressed using parse trees and state diagrams. This means that given grammar and state transition rules, we can express all strings generated by those rules if the language is context-free or regular.

In contrast, English is not a regular language, thus cannot be represented using finite state automaton. While some argue that English is a context-free language and that grammatically correct sentences can be generated using parse trees, those sentences can easily be nonsensical due to the other important concern that is context. For instance, “ship breathes wallet” is a grammatically correct sentences but it has no practical meaning. Hence with our current knowledge in formal language theory, English cannot be represented using a finite set of rules and grammars.


Problem Setup (in mathematical terms)

Mathematical Background

CNNs

RNNs and GRUs

Model

VGGNet

Parameter Prediction Network

Hashing

Model Summary

Training and Results

Training

Error Reduction

Pre-trained GRUs

Fine-tuning

Results

Critique