# Introduction

This paper uses Recurrent Neural Networks to find a recursive structure that is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. This is the first deep learning work which learns full scene segmentation, annotation and classification. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification.

This particular approach for NLP is different in that it handles variable sized sentences in a natural way and captures the recursive nature of natural language. Furthermore, it jointly learns parsing decisions, categories for each phrase and phrase feature embeddings which capture the semantics of their constituents.

# Core Idea

The following figure describes the recursive structure that is present in the images and the sentences.

Fig 1. Illustration of the RNN Parsing Images and Text

Images are first over segmented into regions which are later mapped to semantic feature vector using a neural network. These features are then used as an input to the RNN, which decides whether or not to merge the neighbouring images. This is decided based on a score which is higher if the neighbouring images are similar.

In total the RNN computed 3 outputs :

• Score, indicating whether the neighboring regions should be merged or not
• A new semantic feature representation for this larger region
• Class label

The same procedure is applied to parsing of words too. The semantic features are given as an input to the RNN, then they are merged into phrases in a syntactically and semantically meaningful order.

# Input Representation

Each image is divided into 78 segments, and 119 Features from each segment are extracted. These features include color and texture features , boosted pixel classifier scores (trained on the labelled training data), as well as appearance and shape features.

Each of these images are then transformed semantically by applying it to a neural network layer using a logistic function as the activation unit.

ai=f(WsemFi + bsem)

where W is the weight that we want to learn, F is the Feature Vector, b is the bias and f is the sigmoid function. In this version of experiments, the original sigmoid function $f(x)=\tfrac{1}{1 + e^{-x}}$ was used.

For the sentences each word is represented by an $n$-dimensional vector (n=100 in the paper). The values of these vectors are learned to capture co-occurrence statistics and they are stored as columns in a matrix $L \in \mathbb{R}^{n \times |V|}$ where $|V|$ is the size of the vocabulary (i.e., the total number of unique words that might occur). To extract the features or semantic representation of a word a binary vector $e_k$ with all zeros except for the $k$-th index can be used where $k$ corresponds to the word's column index in $L$. Given this vector, the semantic representation of the wod is obtained by

ai=Lek.

# Recursive Neural Networks for Structure Prediction

In our discriminative parsing architecture, the goal is to learn a function f : X → Y, where Y is the set of all possible binary parse trees. An input x consists of two parts: (i) A set of activation vectors $\{a_1 , . . . , a_{N_{segs}}\}$, which represent input elements such as image segments or words of a sentence. (ii) A symmetric adjacency matrix A, where A(i, j) = 1, if segment i neighbors j. This matrix defines which elements can be merged. For sentences, this matrix has a special form with 1’s only on the first diagonal below and above the main diagonal.

The following figure illustrates how the inputs to RNN look like and what the correct label is. For images, there will be more than one correct binary parse tree, but for sentences there will only have one correct tree.

File:pic2.png
Fig 2. Illustration of the RNN Training Inputs

The structural loss margin for RNN to predict the tree is defined as follows

$\Delta(x,l,y^{\mathrm{proposed}})=\kappa \sum_{d \in N(y^{\mathrm{proposed}})} 1\{subTree(d) \notin Y (x, l)\}$

where the summation is over all non terminal nodes and $\Kappa$ is a parameter. Y(x,l) is the set of correct trees corresponding to input x and label l. $1\{\dots\}$ will be one for a non-empty set and zero for an empty set. To express this in somewhat more natural terms, any subtree that does not occur in any of the ground truth trees will increase the loss by one.

Given the training set, the algorithm will search for a function f with small expected loss on unseen inputs, i.e. File:pic3.png where θ are all the parameters needed to compute a score s with an RNN. The score of a tree y is high if the algorithm is confident that the structure of the tree is correct.

An additional constraint imposed is that the score of the highest scoring tree should be greater than margin defined y the structural loss function so that the model output's as high score as possible on the correct tree and as low score as possible on the wrong tree. This constraint can be expressed as File:pic4.png

With these constraints minimizing the following objective function maximizes the correct tree’s score and minimizes (up to a margin) the score of the highest scoring but incorrect tree. File:pic5.png

For learning the RNN structure, the authors used activation vectors and adjacency matrix as inputs, as well as a greedy approximation since there is no efficient dynamic programming algorithms for their RNN setting.

With an adjacency matrix A, neighboring segments are found with the algorithm and their activations added to a set of potential child node pairs:

$C = \{ [a_i, a_j]: A(i, j) = 1 \}$

So for example, from the image in Fig 2. we would have the following pairs:

$C = \{[a_1, a_2], [a_1, a_3], [a_2, a_1], [a_2, a_4], [a_3, a_1], [a_3, a_4], [a_4, a_2], [a_4, a_3], [a_4, a_5], [a_5, a_4]\}$

Where these are concatenated and given as inputs into the neural network. Potential parent representations for possible child nodes are calculated with:

$p(i, j) = f(W[c_i: c_j] + b)$

And the local score with:

$s(i, j) = W^{score} p(i, j)$

Once the scores for all pairs are calculated, three steps are performed:

1. The highest scoring pair $[a_i, a_j]$ will be removed from the set of potential child node pairs $C$. As well as any other pair containing either $a_i$ or $a_j$.

2. Adjacency Matrix $A$ is updated with a new row and column that reflects new segment along with its child segments.

3. Potential new child pairs are added to $C$.

Steps 1-3 are repeated until all pairs are merged and only one parent activation is left in the set $C$. The last remaining activation is at the root of the Recursive Neural Network that represents the whole image.

The equation that determines the quality of the structure amongst other variants is simply the sum of all the local decisions:

$s(RNN(\theta,x_i,\widehat y))=\sum_{d\in N(\widehat y)}s_d$

# Learning

The objective function J is not differentiable due to hinge loss. Therefore, we must opt for the subgradient method (Ratliff et al., 2007) which computes a gradient-like method called the subgradient. Let $\theta = (W^{sem},W,W^{score},W^{label})$, then the gradient becomes:

$\frac{\partial J}{\partial \theta} = \frac{1}{x} \sum_{i}\frac{\partial s(\hat{y}_i)}{\partial \theta} - \frac{\partial s(\hat{y}_i)}{\partial \theta} + \lambda\theta,$

where $s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(\tau(x_i))}))$ and $s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(Y(x_i,l_i))}))$. In order to compute this gradient, we calculate the derivative by using backpropagation through structure (Goller & Küchler, 1996). L-BFGS was used over the complete training data to minimize the objective. This may cause problems in non-differentiable functions, but none was observed in practice.

# Results

The parameters to tune in this algorithm are n, the size of the hidden layer; κ, the penalization term for incorrect parsing decisions and λ, the regularization parameter. It was found that the method was quite robust, varying in performance by only a few percent for some perameter combinations. The parameter values used were n = 100, κ = 0.05 and λ = 0.001.

For training and testing the researchers opted for the Stanford Background dataset—a dataset that can roughly be categorized into three types: city, countryside and sea-side. The team labelled the images with these three labels and a SVM was trained using the average over all nodes' activations in the tree as features. With an accuracy of 88.1%, this algorithm outperforms the state-of-the art features for scene categorization, Gist descriptors, which obtained only 84.0%. The results are summarized in the following figure. File:pic6.png

In order to show the learned feature representations captured import appearance and label information, the researchers visualized nearest neighbour super segments. The team computed nearest neighbours across all images and all such subtrees. The figure below shows the results. The first image in each row is a random subtree's top node and the remaining nodes are the closest subtrees in the dataset in terms of Euclidean distance between the vector representations.