# Generating Image Descriptions

## Introduction and Motivation

People often say that a picture is worth a thousand words, but with the current tools available, NLP and image recognition algorithms are unable to extract nearly that much meaningful text from images. Contrarily, when a human glances at an image, they can easily identify numerous details and relationships within a detailed visual scene. Despite numerous advancements in image classification, object detection, and natural language processing, this particular task has proven challenging for existing visual recognition models. The majority of similar visual recognition models focus on labeling objects [CITATIONS], or generating image descriptions based on limited vocabularies and fixed sentence models [CITATIONS]. The authors of this paper believed that the assumptions of these types of models were too restrictive, leading to an inability to generate rich descriptions that the human mind is capable of.

In this paper, the restriction of hard-coded word templates, and sentence structures are removed. Using this assumption-free ideal, the Kapersky and Li create a model that can generate richer descriptions drawn from a wider vocabulary. Due to the lack of directly relevant training data for this task, the authors leveraged the large quantity of images with detailed captions available online. Although these images contained no explicit information about image region labels or relationships between regions, they could treat individual captions as weak labels in which contiguous segments of words correspond to some particular, but unknown location in the image. By inferring the location of alignment between the corpus and image, a generative model is created that can generate detailed descriptions as seen in Figure 1.

In particular, this paper introduces a neural network that infers latent text relationships with corresponding image region relationships. This is achieved through an optimization problem which associates probabilities related to a multimodal image embedding space to a carefully constructed objective function. Validation of this approach is conducted on image-sentence retrieval tasks.

Furthermore, a multimodal Recurrent Neural Network (RNN) structure is used to take an input image, and generate text descriptions along with their corresponding regions. Simulations show that the generated sentences outperform rigid, retrieval based models.

## Relevant Concepts

RNN(Recurrent Neural Network): The main feature of Recurrent neural network is for each step the decision of last step also become a source of input in the next step. Along with the original input, recurrent neural network have two sources of input.

Give a simple example, if you decide to cook apple pie, burger and chicken in that order. If it’s sunny then next day you cook the same dish, otherwise you cook the next dish. This is a simple recurrent neural network. The dish you cooked today is a also an input for the next day’s decision making. (detail in here)

So mathematically speaking, The decision a recurrent net reached at time step t-1 affects the decision it will reach one moment later at time step t.

BRNN(Bidirection Recurrent Neural Network) A regular recurrent neural network can only take past information into the current input, the future information cannot be reached from the current state. However, BRNN don’t have such restriction. So the BRNN have two sources of inputs from opposite directions.

## Literature Review

Before analyzing Kapersky and Li's visual recognition model in more depth, we first discuss the shortcomings of similar models.

## The Model: Alignment

### Overview

Given a picture, there are usually multiple things that can appear in the picture. Thus one of the first problems that need to be addressed is how to align the visual data with the language data. Sentences often refer to a particular location to the image, however the algorithm must be able to identify which location it is referring to. The following section assumes that we are given an input dataset of images and sentence descriptions, and it will describe how the alignment model matches the two.

There are two main parts to this task:

i) A neural network that maps sentences and images into a "common embedding"

ii) An objective which learns this "common embedding" so that semantically similar concepts will occupy similar regions within the space

### Part I: Representing Images and Sentences

Images

Given an image, a Region Convolutional Neural Network (Girshick et al) is used to detect every possible object that appears in the image. From all the possible objects detected, the top 19 are selected, along the full image.

To create a common embedding, every image is represented by a set of h-dimensional vectors $\{v_i | i = 1 ... 20\}$ where each $v_i$ is represented by a convolutional neural network.

Sentences

We want to embed the sentences into the same h-dimensional space as above. The paper explains how this can be done using a Bidirectional Recurrent Neural Network.

Given a sequence of $N$ words, each word is enriched by a variably-sized context around that word.

Let $t = 1 ... N$ representing the position of the word in the sentence.

$I_t$ is an indicator column that has 1 at the index of the t-th word.

$W_w$ is a word embedding matrix that is initialized with a 300-dimensional matrix of word2vec weights.

$h^f_t$ represents a forward pass of the neural network.

$h^b_t$ represents a backward pass of the neural network.

Note that in the paper a Relu ($f: x \rightarrow max(0, x)$) activation function is used.

### Part II: Alignment Objective

Now that we have the images and sentences in the same embedding space, we need to define an objective such that a sentence-image pair will have a high score if they are relevant.

The paper made a simplification on a previous objective function as outlined in a paper by Girshick et al.

### Part III Recurrent Neural Network

Input : Imagines and corresponding textual description

Key challenge : the design of a model that can predict variable-sized sequence of description

Previous : define a probability distribution of next word distribution given the current word and the context

Extension : Additionally condition the generative process on an image

That is when training, taking the image pixels and input vectors, computing hidden states and outputs by iterating the following

Note that the image context vector $b_v$ is only provided at the first iteration, which the author believes to be better than at each time step

Also, it can help to also pass $b_v$ , $(W_{h x} x_t)$ through the activation function

Typical size of hidden layer is 512 neurons

RNN Training

The RNN is trained to combine a word $x_t$, the previous context $h_{t-1}$ to predict the next word $y_t$

We condition the RNN’s predictions on the image information $b_v$ via bias interactions on the first step.

Traning process :

Set $h_0 = 0$, $x_1$ to a special START vector, and $y_1$ as the first word in the sequence

Set $x_2$ to the word vector of the first word and expect the network to predict the second word

Finally, on the last step when xT represents the last word, the target label is set to a special END token. The cost function is to maximize the log probability assigned to the target labels (i.e. Softmax classifier). RNN at test time. To predict a sentence, we compute the image representation bv, set h0 = 0, x1 to the START vector and compute the distribution over the first word y1. We sample a word from the distribution (or pick the argmax), set its embedding vector as x2, and repeat this process until the END token is generated. In practice we found that beam search (e.g. beam size 7) can improve results.

## Experimental Results & Limitations

Model tested on three datasets (Flickr8K, Flickr30K, MSCOCO).

First, quality of inferred text and image alignment compared against previous work using Recall@K and median r. Main findings include: (1) proposed models (specifically BRNN) outperforms previous work, (2) performance improvement from simpler cost function, and (3) model learns to modulate the magnitude of the region and word embeddings.

Second, ability of image and region description is evaluated using BLEU, METEOR, and CIDEr scores. Main findings include: (1) Multimodal RNN outperforms retrieval baseline, and (2) proposed model does not outperform Google NIC and LRCN, but it is simpler and less costly.

Last, the Multimodal RNN is trained on the aligning image regions with texts. Main finding is that region model outperforms full frame model and ranking baseline.

Though results are optimistic, there are several limitations with the Multimodal RNN model. Namely, it is limited to generating one input array of pixels at fixed resolution, it uses additive bias interaction to gather image information, and it is built on two separate models. -

## Conclusions

Throughout the entire paper, authors have introduced the following concepts:

1.Generates natural language descriptions of image regions based on weak labels.

Q: What labels exactly?

A: Dataset with images and sentences.

2. This model has few hard-coded assumptions. Novel ranking through a common, multi-modal embedding.

We have taken several generative steps underlying here:

Align visual and language data:

---Detect image using RCNN, with CNN pre-trained.

---Compute the sentenced describing the image in the same dimensional embedding layers, using BRNN.

---MRNN for generating descriptions:

  Main challenges —— variable size not fixed
Solution: based on RNN and time series, define the next series based on current variables that we have.


Experiments:

---Image-sentence alignment ranking: High among all other methods, with a top 5 median.

---Full-frame evaluation-Image: slightly outperform many of the existing models.

---Region evaluation: Though still significant lower than human opinion, but surpassed most of the existing models.

## Steps to take afterwards

Limited input properties (e.g. one input array of pixels, images need to have a fixed resolution) ***very big issue as currently a lot of images are varied in the size and resolution.***

The RNN receives the image information only through additive bias interactions, which are known to be less expressive. That being said, the interaction the model chose is case sensitive, which is hard to achieve in real life.

In this model, they divided the image-sentence data set into two parts, descriptive sentences and visual images. Is there any possibility to skip the division and solve the thing using one single model?

## Reference

1. Karpathy, A., & Li, F. (2015, April 14). Deep Visual-Semantic Alignments for Generating Image Descriptions. Retrieved March 19, 2018, from https://arxiv.org/pdf/1412.2306v2.pdf

2. A Beginner’s Guide to Recurrent Networks and LSTMs. (n.d.). Retrieved from https://deeplearning4j.org/lstm.html#recurrent