Introduction

This paper uses Recurrent Neural Networks to find a recursive structure that is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. This is the first deep learning work which learns full scene segmentation, annotation and classification. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification.

For vision applications, the approach differs from any previous works in that it uses off-the-shelf vision features of segments obtained from oversegmented images instead of learning feature from raw images. In addition, the same network can be used recursively to achieve classification instead of building a hierarchy by a convolutional neural network.

Also, this particular approach for NLP is different in that it handles variable sized sentences in a natural way and captures the recursive nature of natural language. Furthermore, it jointly learns parsing decisions, categories for each phrase and phrase feature embeddings which capture the semantics of their constituents.

Core Idea

The following figure describes the recursive structure that is present in the images and the sentences.

Fig 1. Illustration of the RNN Parsing Images and Text

Images are first over segmented into regions which are later mapped to semantic feature vector using a neural network. These features are then used as an input to the RNN, which decides whether or not to merge the neighbouring images. This is decided based on a score which is higher if the neighbouring images are similar.

In total the RNN computed 3 outputs :

• Score, indicating whether the neighboring regions should be merged or not
• A new semantic feature representation for this larger region
• Class label

The same procedure is applied to parsing of words too. The semantic features are given as an input to the RNN, then they are merged into phrases in a syntactically and semantically meaningful order.

Input Representation

Each image is divided into 78 segments, and 119 Features(described by Gould et al.<ref> Gould, Stephen, Richard Fulton, and Daphne Koller. "Decomposing a scene into geometric and semantically consistent regions." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009. </ref>) from each segment are extracted. These features include color and texture features , boosted pixel classifier scores (trained on the labelled training data), as well as appearance and shape features.

Each of these images are then transformed semantically by applying it to a neural network layer using a logistic function as the activation unit.

ai=f(WsemFi + bsem)

where W is the weight that we want to learn, F is the Feature Vector, b is the bias and f is the sigmoid function. In this version of experiments, the original sigmoid function $f(x)=\tfrac{1}{1 + e^{-x}}$ was used.

For the sentences each word is represented by an $n$-dimensional vector (n=100 in the paper). The values of these vectors are learned to capture co-occurrence statistics and they are stored as columns in a matrix $L \in \mathbb{R}^{n \times |V|}$ where $|V|$ is the size of the vocabulary (i.e., the total number of unique words that might occur). To extract the features or semantic representation of a word a binary vector $e_k$ with all zeros except for the $k$-th index can be used where $k$ corresponds to the word's column index in $L$. Given this vector, the semantic representation of the wod is obtained by

ai=Lek.

Recursive Neural Networks for Structure Prediction

In our discriminative parsing architecture, the goal is to learn a function f : X → Y, where Y is the set of all possible binary parse trees. An input x consists of two parts: (i) A set of activation vectors $\{a_1 , . . . , a_{N_{segs}}\}$, which represent input elements such as image segments or words of a sentence. (ii) A symmetric adjacency matrix A, where A(i, j) = 1, if segment i neighbors j. This matrix defines which elements can be merged. For sentences, this matrix has a special form with 1’s only on the first diagonal below and above the main diagonal.

The following figure illustrates how the inputs to RNN look like and what the correct label is. For images, there will be more than one correct binary parse tree, but for sentences there will only have one correct tree.

File:pic2.png
Fig 2. Illustration of the RNN Training Inputs

The structural loss margin for RNN to predict the tree is defined as follows

$\Delta(x,l,y^{\mathrm{proposed}})=\kappa \sum_{d \in N(y^{\mathrm{proposed}})} 1\{subTree(d) \notin Y (x, l)\}$

where the summation is over all non terminal nodes and $\Kappa$ is a parameter. Y(x,l) is the set of correct trees corresponding to input x and label l. $1\{\dots\}$ will be one for a non-empty set and zero for an empty set. To express this in somewhat more natural terms, any subtree that does not occur in any of the ground truth trees will increase the loss by one.

Given the training set, the algorithm will search for a function f with small expected loss on unseen inputs, i.e. File:pic3.png where θ are all the parameters needed to compute a score s with an RNN. The score of a tree y is high if the algorithm is confident that the structure of the tree is correct.

An additional constraint imposed is that the score of the highest scoring tree should be greater than margin defined y the structural loss function so that the model output's as high score as possible on the correct tree and as low score as possible on the wrong tree. This constraint can be expressed as File:pic4.png

With these constraints minimizing the following objective function maximizes the correct tree’s score and minimizes (up to a margin) the score of the highest scoring but incorrect tree. File:pic5.png

For learning the RNN structure, the authors used activation vectors and adjacency matrix as inputs, as well as a greedy approximation since there is no efficient dynamic programming algorithms for their RNN setting.

With an adjacency matrix A, neighboring segments are found with the algorithm and their activations added to a set of potential child node pairs:

$C = \{ [a_i, a_j]: A(i, j) = 1 \}$

So for example, from the image in Fig 2. we would have the following pairs:

$C = \{[a_1, a_2], [a_1, a_3], [a_2, a_1], [a_2, a_4], [a_3, a_1], [a_3, a_4], [a_4, a_2], [a_4, a_3], [a_4, a_5], [a_5, a_4]\}$

Where these are concatenated and given as inputs into the neural network. Potential parent representations for possible child nodes are calculated with:

$p(i, j) = f(W[c_i: c_j] + b)$

And the local score with:

$s(i, j) = W^{score} p(i, j)$

Once the scores for all pairs are calculated, three steps are performed:

1. The highest scoring pair $[a_i, a_j]$ will be removed from the set of potential child node pairs $C$. As well as any other pair containing either $a_i$ or $a_j$.

2. Adjacency Matrix $A$ is updated with a new row and column that reflects new segment along with its child segments.

3. Potential new child pairs are added to $C$.

Steps 1-3 are repeated until all pairs are merged and only one parent activation is left in the set $C$. The last remaining activation is at the root of the Recursive Neural Network that represents the whole image.

The equation that determines the quality of the structure amongst other variants is simply the sum of all the local decisions:

$s(RNN(\theta,x_i,\widehat y))=\sum_{d\in N(\widehat y)}s_d$

Unsupervised Recursive Autoencoer for structure Prediction<ref>http://nlp.stanford.edu/pubs/SocherPenningtonHuangNgManning_EMNLP2011.pdf</ref>.

Instead of using scores (as described above) to predict the tree structure, we can also use the reconstruction error to predict the structure.

How do we find reconstruction error?

1. p is the parent representation for children $[c_1;c_2]$ (same as before)

   $p=f(W^{(1)}[c_1;c_2]+b^{(1)})$



2. one way of assessing how well this p represents its children is to reconstruct the children in a reconstruction layer.

   $[c_1^';c_2^']=W^{(2)}p+b^{(2)}$


3. Then, the reconstruction is defined below. The goal is to minimize $E_{rec}([c_1];c_2)$.

   $E_{rec}([c_1;c_2])=\frac{1}{2}||[c_1;c_2]-[c_1^';c_2^']||^2$


How to construct the tree?

• It first takes the first pair of neighboring vecotrs $(c_1;c_2)=(x_1;x_2)$. We save the parent node and the resulting reconstruction error. The network shifted by one position and takes as input vectors $(c_1;c_2)=(x_2;x_3)$ and obtains $p, E_{rec}$. Repeat the process until it hits the last pair.
• Select the pair with lowest $E_rec$.

eg. Given sequence $(x_1,x_2,x_3,x_4)$, we get lowest $E_rec$ by the pair $(x_3, x_4)$. The new sequence then consists of $(x_1, x_2, p(3,4))$.

• The process repeats and treats the new vector $p(3,4)$ like any other vector.
• The process stops until it reaches a deterministic choice of collapsing the remaining two states into one parent. The tree is then recovered.

Learning

The objective function J is not differentiable due to hinge loss. Therefore, we must opt for the subgradient method (Ratliff et al., 2007) which computes a gradient-like method called the subgradient. Let $\theta = (W^{sem},W,W^{score},W^{label})$, then the gradient becomes:

$\frac{\partial J}{\partial \theta} = \frac{1}{x} \sum_{i}\frac{\partial s(\hat{y}_i)}{\partial \theta} - \frac{\partial s(\hat{y}_i)}{\partial \theta} + \lambda\theta,$

where $s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(\tau(x_i))}))$ and $s(\hat{y}_i) = s(RNN(\theta,x_i,\hat{y}_{max(Y(x_i,l_i))}))$. In order to compute this gradient, we calculate the derivative by using backpropagation through structure (Goller & Küchler, 1996). L-BFGS was used over the complete training data to minimize the objective. This may cause problems in non-differentiable functions, but none was observed in practice.

L-BFGS is short for Limited-memory BFGS, which is an iterative method for solving unconstrained nonlinear optimization problems that using a limited amount of computer memory. Thus it is particularly suited to problems with very large numbers of variables (e.g., >1000)<ref> https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm </ref>.

Results

The parameters to tune in this algorithm are n, the size of the hidden layer; κ, the penalization term for incorrect parsing decisions and λ, the regularization parameter. It was found that the method was quite robust, varying in performance by only a few percent for some perameter combinations. The parameter values used were n = 100, κ = 0.05 and λ = 0.001.

Additional resources that are helpful for replicating and extending this work can be found here on the first author's personal website. This includes the source code for the project, as well as download links for the datasets used.

Scene understanding

For training and testing the researchers opted for the Stanford Background dataset—a dataset that can roughly be categorized into three types: city, countryside and sea-side. The team labelled the images with these three labels and a SVM was trained using the average over all nodes' activations in the tree as features. With an accuracy of 88.1%, this algorithm outperforms the state-of-the art features for scene categorization, Gist descriptors, which obtained only 84.0%. The results are summarized in the following figure. File:pic6.png

A single neural network layer followed by a softmax layer is also tested in this paper, which performed about 2% worse than the full RNN model.

In order to show the learned feature representations captured import appearance and label information, the researchers visualized nearest neighbour super segments. The team computed nearest neighbours across all images and all such subtrees. The figure below shows the results. The first image in each row is a random subtree's top node and the remaining nodes are the closest subtrees in the dataset in terms of Euclidean distance between the vector representations.

Natural language processing

The method was also tested on natural language processing with the Wall Street Journal section of the Penn Treebank and was evaluated with the F-measure (Manning & Schütz, 1999). While the widely used Berkeley parser was not outperformed, the scores are close (91.63% vs 90.29%). Interestingly, no syntactic information of the child nodes is provided by the parser to the parent nodes. All syntatic information used is encoded in the learned continuous representations.

Similar to the nearest neighbour scene subtrees, nearest neighbours for multiword phrases were collected. For example, "All the figures are adjusted for seasonal variatons" is a close neighbour to "All the numbers are adjusted for seasonal fluctuations".

<references />