Difference between revisions of "Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction"
(Added up to Related Works) |
(Added blurb about source/authors of the paper) |
||
Line 1: | Line 1: | ||
+ | The paper ''Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction'' was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as at May 2018) obtained from [https://arxiv.org/pdf/1802.05451v3.pdf arXiv] | ||
+ | |||
+ | (*) Equal contribution | ||
+ | |||
=Motivation= | =Motivation= | ||
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance, and proves state of the art performance on models that follow this principle. | In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance, and proves state of the art performance on models that follow this principle. |
Revision as of 00:25, 14 November 2018
The paper Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction was written by Roei Herzig* from Tel Aviv University, Moshiko Raboh* from Tel Aviv University, Gal Chechik from Google Brain, Bar-Ilan University, Jonathan Berant from Tel Aviv University, and Amir Globerson from Tel Aviv University. This paper is part of the NIPS 2018 conference to be hosted in December 2018 at Montréal, Canada. This paper summary is based on version 3 of the pre-print (as at May 2018) obtained from arXiv
(*) Equal contribution
Contents
Motivation
In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance, and proves state of the art performance on models that follow this principle.
The primary contributions that this paper makes include:
- Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures
- Empirically proving the benefit of graph-permutation invariance
- Developing a state-of-the-art model for scene graph predictions over large set of complex visual scenes
Introduction
In order for a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A scene graph is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes and relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.
Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.
In structured prediction models, a score function [math]s(x, y)[/math] is defined to evaluate the compatibility between label [math]y[/math] and input [math]x[/math]. For instance, when interpreting the scene of an image, [math]x[/math] refers to the image itself, and [math]y[/math] refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label [math]y*[/math] such that [math]s(x,y)[/math] is maximized. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function.
The paper presents an alternative approach, for which input [math]x[/math] is mapped to structured output [math]y[/math] using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.
Structured prediction
This paper further considers structured predictions using score-based methods. For structured prediction that follow a score-based approach, a score function [math]s(x, y)[/math] is used to measure how compatible label [math]y[/math] is for input [math]x[/math]. To optimize the score function, previous works have decomposed [math]s(x,y) = \sum_i f_i(x,y)[/math] in order to facilitate efficient optimization for each local score function, [math]\max_y f_i(x,y)[/math].
In the area of structured predictions, the most commonly-used score functions include the singleton score function [math]f_i(y_i, x)[/math] and pairwise score function [math]f_{ij} (y_i, y_j, x)[/math]. Previous works explored a two-stage architectures (learn local scores independently of the structured prediction goal), and end-to-end architectures (to include the inference algorithm within the computation graph).
Advantages of using score-based methods
- Allow for intuitive specification of local dependencies between labels, and how they map to global dependencies
- Linear score functions offer natural convex surrogates
- Inference in large label space is sometimes possible via exact algorithms, or empirically accurate approximations
The concern for modelling score functions using deep networks is that learning may no longer be convex. Hence, the paper presents properties for how deep networks can be used for structured predictions by considering architectures that do not require explicit maximization of a score function.
Background, Notations, and Definitions
We denote [math]y[/math] as a structured label where [math]y = [y_1, \dots, y_n][/math]
Score functions: for score-based methods, the score is defined as either the sum of a set of singleton scores [math]f_i = f_i(y_i, x)[/math] or the sum of pairwise scores [math]f_{ij} = f_{ij}(y_i, y_j, x)[/math].
Let [math]s(x,y)[/math] be the score of a score-based method. Then:
[math]s(x,y) = \begin{cases} \sum_i f_i ~ \text{if we have a set of singleton scores}\\ \sum_{ij} f_{ij} ~ \text{if we have a set of pairwise scores } \\ \end{cases}[/math]
Inference algorithm: an inference algorithm takes input set of local scores (either [math]f_i[/math] or [math]f_{ij}[/math]) and outputs an assignment of labels [math]y_1, \dots, y_n[/math] that maximizes score function [math]s(x,y)[/math]
Graph labeling function: a graph labeling function [math]\mathcal{F} : (V,E) \rightarrow Y[/math] is a function that takes input of: an ordered set of node features [math]V = [z_1, \dots, z_n][/math] and an ordered set of edge features [math]E = [z_{1,2},\dots,z_{i,j},\dots,z_{n,n-1}][/math] to output set of node labels [math]\mathbf{y} = [y_1, \dots, y_n][/math]. For instance, [math]z_i[/math] can be set equal to [math]f_i[/math] and [math]z_{ij}[/math] can be set equal to [math]f_{ij}[/math].
For convenience, the joint set of nodes and edges will be denoted as [math]\mathbf{z}[/math] to be a size [math]n^2[/math] vector ([math]n[/math] nodes and [math]n(n-1)[/math] edges).
Permutation: Let [math]z[/math] be a set of node and edge features. Given a permutation [math]\sigma[/math] of [math]\{1,\dots,n\}[/math], let [math]\sigma(z)[/math] be a new set of node and edge features given by [[math]\sigma(z)]_i = z_{\sigma(i)}[/math] and [math][\sigma(z)]_{i,j} = z_{\sigma(i), \sigma(j)}[/math]
One-hot representation: [math]\mathbf{1}[j][/math] be a one-hot vector with 1 in the [math]j^{th}[/math] coordinate
Permutation-Invariant Structured prediction
With permutation-invariant structured prediction, we would expect the algorithm to produce the same result given the same score function. For instance, consider the case where we have label space for 3 variables [math]y_1, y_2, y_3[/math] with input [math]\mathbf{z} = (f_1, f_2, f_3, f_{12}, f_{13}, f_{23})[/math] that outputs label [math]\mathbf{y} = (y_1^*, y_2^*, y_3^*)[/math]. Then if the algorithm is run on a permuted version input [math]z' = (f_2, f_1, f_3, f_{21}, f_{23}, f_{13})[/math], we would expect [math]\mathbf{y} = (y_2^*, y_1^*, y_3^*)[/math] given the same score function.
Graph permutation invariance (GPI): a graph labeling function [math]\mathcal{F}[/math] is graph-permutation invariant, if for all permutations [math]\sigma[/math] of [math]\{1, \dots, n\}[/math] and for all nodes [math]z[/math], [math]\mathcal{F}(\sigma(\mathbf{z})) = \sigma(\mathcal{F}(\mathbf{z}))[/math]
The paper presents a theorem on the necessary and sufficient conditions for a function [math]\mathcal{F}[/math] to be graph permutation invariant. Intuitively, because [math]\mathcal{F}[/math] is a function that takes an ordered set [math]z[/math] as input, the output on [math]\mathbf{z}[/math] could very well be different from [math]\sigma(\mathbf{z})[/math], which means [math]\mathcal{F}[/math] needs to have some sort of symmetry in order to sustain [math][\mathcal{F}(\sigma(\mathbf{z}))]]_k = [\mathcal{F}(\mathbf{z})]_{\sigma(k)}[/math].
Theorem 1
Let [math]\mathcal{F}[/math] be a graph labeling function. Then [math]\mathcal{F}[/math] is graph-permutation invariant if and only if there exist functions [math]\alpha, \rho, \phi[/math] such that for all [math]k=1, .., n[/math]: \begin{align} [\mathcal{F}(\mathbf{z})]_k = \rho(\mathbf{z}_k, \sum_{i=1}^n \alpha(\mathbf{z}_i, \sum_{i\neq j} \phi(\mathbf{z}_i, \mathbf{z}_{i,j}, \mathbf{z}_j))) \end{align} where [math]\phi: \mathbb{R}^{2d+e} \rightarrow \mathbb{R}^L, \alpha: \mathbb{R}^{d + L} \rightarrow \mathbb{R}^{W}, p: \mathbb{R}^{W+d} \rightarrow \mathbb{R}[/math].
Notice that for the dimensions of inputs and outputs, [math]d[/math] refers to the number of singleton features in [math]z[/math] and [math]e[/math] refers to the number of edges.
Proof Sketch for Theorem 1
The proof of this theorem can be found in the paper. A proof sketch below is provided:
For the forward direction (function that follows the form set out in equation (1) is GPI):
- Using definition of permutation [math]\sigma[/math], and rewriting [math][F(z)]_{\sigma(k)}[/math] in the form from equation (1)
- Second argument of [math]\rho[/math] is invariant under [math]\sigma[/math], since it takes the sum of all indices [math]i[/math] and all other indices [math]j \neq i [/math].
For the backward direction (any black-box GPI function can be expressed in the form of equation 1):
- Construct [math]\phi, \alpha[/math] such that second argument of [math]\rho[/math] contains all information about graph features of [math]z[/math], including edges that the features originate from
- Assume each [math]z_k[/math] uniquely identifies the node and [math]\mathcal{F}[/math] is a function only of pairwise features [math]z_{i,j}[/math]
- Construct [math]H[/math] be a perfect hash function with [math]L[/math] buckets, and [math]\phi[/math] which maps pairwise features to a vector of size [math]L[/math]
- [math]*[/math]Construct [math]\phi(z_i, z_{i,j}, z_j) = \mathbf{1}[H(z_j)] z_{i,j}[/math], which intuitively means that [math]\phi[/math] stores [math]z_{i,j}[/math] in the unique bucket for node [math]j[/math]
- Construct function [math]\alpha[/math] to output a matrix [math]\mathbb{R}^{L \times L}[/math] that maps each pairwise feature into unique positions ([math]\alpha(z_i, s_i) = \mathbf{1}[H(z_i)]s_i^T[/math])
- Construct matrix [math]M = \sum_i \alpha(z_i,s_i)[/math] by discarding rows/columns in [math]M[/math] that do not correspond to original nodes (which reduces dimension to [math]n\times n[/math]; set [math]\rho[/math] to have same outcome as [math]\mathcal{F}[/math], and set the output of [math]\mathcal{F}[/math] on [math]M[/math] to be the labels [math]\mathbf{y} = y_1, \dots, y_n[/math]
[math]*[/math]The paper presents the proof for the edge features [math]z_{ij}[/math] being scalar ([math]e = 1[/math]) for simplicity, which can be extended easily to vectors with additional indexing.
Although the results discussed previously apply to complete graphs (edges apply to all feature pairs), it can be easily extended to incomplete graphs. However, in place of permutation-invariance, it is now an automorphism-invariance.
Implications and Applications of Theorem 1
Key Implications of Theorem 1
- Architecture "collects" information from the different edges of the graph, and does so in an invariant fashion using [math]\alpha[/math] and [math]\phi[/math]
- Architecture is parallelizable, since all [math]\phi[/math] functions can be applied simultaneously
Some applications of Theorem 1
- Attention: the concept of attention can be implemented in the GPI characterization, with slight alterations to the functions [math]\alpha[/math] and [math]\phi[/math]. The complete details can be found in the supplementary materials of the paper.
- RNN: recurrent architectures can maintain GPI property, since all GPI function [math]\mathcal{F}[/math] are closed under composition. The output of one step after running [math]\mathcal{F}[/math] will act as input for the next step, but maintain the GPI property throughout.
Related Work
- Architectural invariance: suggested recently in a 2017 paper called Deep Sets by Zaheer et al., which considers the case of invariance that is more restrictive.
- Deep structured prediction: previous work applied deep learning to structured prediction, for instance semantic segmentation. Some algorithms include: message passing algorithms, gradient descent for maximizing score functions, greedy decoding (inference of labels based on time of previous labels). Apart from those algorithms, deep learning has been applied to other graph-based problems such as the Travelling Salesman Problem (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017). However, none of the previous work specifically address the notion of invariance in the general architecture, but rather focus on message passing architectures that can be generalized by this paper.
- Scene graph prediction: scene graph extraction allows for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017). Some other works in this area include object detection, action recognition, and even detection of human-object ineractions (Liao et al., 2016; Plummer et al., 2017). Additional work has been done with the use of message passing algorithms (Xu et al., 2017), word embeddings (Lu et al., 2016), and end-to-end prediction directly from pixels (Newell & Deng, 2017). A notable mention is NeuralMotif (Zellers et al., 2017), which the authors describe as the current state-of-the-art model for scene graph predictions on Visual Genome dataset.