Difference between revisions of "STAT946F17/ Teaching Machines to Describe Images via Natural Language Feedback"

From statwiki
Jump to: navigation, search
(Crowd-sourcing Human Feedback)
(Crowd-sourcing Human Feedback)
Line 59: Line 59:
=== Feedback Network ===
=== Feedback Network ===

Revision as of 01:10, 2 November 2017


In the era of Artificial Intelligence, one should ideally be able to educate the robot about its mistakes, possibly without needing to dig into the underlying software. Reinforcement learning has become a standard way of training artificial agents that interact with an environment. Several works explored the idea of incorporating humans in the learning process, in order to help the reinforcement learning agent to learn faster. In most cases, the guidance comes in the form of a simple numerical (or “good”/“bad”) reward. In this work, natural language is used as a way to guide an RL agent. The author argues that a sentence provides a much stronger learning signal than a numeric reward in that we can easily point to where the mistakes occur and suggest how to correct them.

Here the goal is to allow a non-expert human teacher to give feedback to an RL agent in the form of natural language, just as one would to a learning child. The author has focused on the problem of image captioning in which the quality of the output can easily be judged by non-experts.

Related Works

Several works incorporate human feedback to help an RL agent learn faster.

  1. Thomaz et al. [2006] exploits humans in the loop to teach an agent to cook in a virtual kitchen. The users watch the agent learn and may intervene at any time to give a scalar reward. Reward shaping (Ng et al. [1999]) is used to incorporate this information in the Markov Decision Process (MDP).
  2. Judah et al. [2010] iterates between “practice”, during which the agent interacts with the real environment, and a critique session where a human labels any subset of the chosen actions as good or bad.
  3. Griffith et al. [2013] proposes policy shaping which incorporates right/wrong feedback by utilizing it as direct policy labels.

Above approaches mostly assume that humans provide a numeric reward. A few attempts have been made to advise an RL agent using language.

  1. Maclin et al. [1994] translated advice to a short program which was then implemented as a neural network. The units in this network represent Boolean concepts, which recognize whether the observed state satisfies the constraints given by the program. In such a case, the advice network will encourage the policy to take the suggested action.
  2. Weston et al. [2016] incorporates human feedback to improve a text-based question answering agent.
  3. Kaplan et al. [2017] exploits textual advice to improve training time of the A3C algorithm in playing an Atari game.

The Phrase-based Image Captioning Model is similar to most image captioning models except that it exploits attention and linguistic information. Several recent approaches trained the captioning model with policy gradients in order to directly optimize for the desired performance metrics. This work follows the same line.

There is also similar efforts on dialogue based visual representation learning and conversation modeling. These models aim to mimic human-to-human conversations while in this work the human converses with and guides an artificial learning agent.


The framework consists of a new phrase-based captioning model trained with Policy Gradients that incorporates natural language feedback provided by a human teacher. The phrase-based captioning model allows natural guidance by a nonexpert.

Phrase-based Image Captioning

The captioning model uses a hierarchical Recurrent Neural Network. The model is composed of a two-level LSTM, a phrase RNN at the top level, and a word RNN that generates a sequence of words for each phrase. One can think of the phrase RNN as providing a “topic” at each time step, which instructs the word RNN what to talk about. The structure of the model is explained through the following figure.

Model hamid.jpg

A convolutional neural network is used in order to extract a set of feature vectors $a = (a_1, \dots, a_n)$, with $a_j$ a feature in location j in the input image. These feature vectors are given to the attention layer. There are also two more inputs to the attention layer, current hidden state of the phrase-RNN and output of the label unit. The label unit predicts one out of four possible phrase labels, i.e., a noun (NP), preposition (PP), verb (VP), conjunction phrase (CP), and an additional <EOS> token to indicate the end of the sentence. This information could be useful for the attention layer. For example, when we have a NP the model may look at objects in the image, while for VP it may focus on more global information. Computations can be expressed with the following equations:

$$ \small{\textrm{hidden state of the phrase-RNN at time step t}} \leftarrow h_t = f_{phrase}(h_{t-1}, l_{t-1}, c_{t-1}, e_{t-1}) \\ \small{\text{output of the label unit}} \leftarrow l_t = softmax(f_{phrase-label}(h_t)) \\ \small{\text{output of the attention layer}} \leftarrow c_t = f_{att}(h_t, l_t, a) $$

After deciding about phrases, the outputs of phrase-RNN go to another LSTM to produce words for each phrase. $w_{t,i}$ denotes the i-th word output of the word-RNN in the t-th phrase. There is an additional <EOP> token in word-RNN’s vocabulary, which signals the end-of-phrase. Furthermore, $h_{t,i}$ denotes the i-th hidden state of the word-RNN for the t-th phrase. $$ h_{t,i} = f_{word}(h_{t,i-1}, c_t, w_{t,i}) \\ w_{t,i} = f_{out}(h_{t,i}, c_t, w_{t,i-1}) \\ e_t = f_{word-phrase}(w_{t,1}, \dots ,w_{t,n}) $$

Note that $e_t$ encodes the generated phrase via simple mean-pooling over the words, which provides additional word-level context to the next phrase.

Crowd-sourcing Human Feedback

The author has created a web interface that allows to collect feedback information. Two rounds of annotation are designed. In the first round, the annotator is shown a captioned image and is asked to assess the quality of the caption, by choosing between: perfect, acceptable, grammar mistakes only, minor or major errors. We asked the annotators to choose minor and major error if the caption contained errors in semantics. We advised them to choose minor for small errors such as wrong or missing attributes or awkward prepositions, and go with major errors whenever any object or action naming is wrong.

For the next (more detailed, and thus more costly) round of annotation, we only select captions which are not marked as either perfect or acceptable in the first round. Since these captions contain errors, the new annotator is required to provide detailed feedback about the mistakes. Annotators are asked to"

  1. Choose the type of required correction (something “ should be replaced”, “is missing”, or “should be deleted”)
  2. Write feedback in natural language (annotators are asked to describe a single mistake at a time)
  3. Mark the type of mistake (whether the mistake corresponds to an error in object, action, attribute, preposition, counting, or grammar)
  4. Highlight the word/phrase that contains the mistake
  5. Correct the chosen word/phrase
  6. Evaluate the quality of the caption after correction (it could be bad even after one round of correction)


Feedback Network

Policy Gradient Optimization using Natural Language Feedback

Experimental Results