# Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

# Motivation

In the field of artificial intelligence, a major goal is to enable machines to understand complex images, such as the underlying relationships between objects that exist in each scene. Although there are models today that capture both complex labels and interactions between labels, there is a disconnect for what guidelines should be used when leveraging deep learning. This paper introduces a design principle for such models that stem from the concept of permutation invariance, and proves state of the art performance on models that follow this principle.

The primary contributions that this paper makes include:

- Deriving sufficient and necessary conditions for respecting graph-permutation invariance in deep structured prediction architectures
- Empirically proving the benefit of graph-permutation invariance
- Developing a state-of-the-art model for scene graph predictions over large set of complex visual scenes

# Introduction

In order for a machine to interpret complex visual scenes, it must recognize and understand both objects and relationships between the objects in the scene. A **scene graph** is a representation of the set of objects and relations that exist in the scene, where objects are represented as nodes and relations are represented as edges connecting the different nodes. Hence, the prediction of the scene graph is analogous to inferring the joint set of objects and relations of a visual scene.

Given that objects in scenes are interdependent on each other, joint prediction of the objects and relations is necessary. The field of structured prediction, which involves the general problem of inferring multiple inter-dependent labels, is of interest for this problem.

In structured prediction models, a score function [math]s(x, y)[/math] is defined to evaluate the compatibility between label [math]y[/math] and input [math]x[/math]. For instance, when interpreting the scene of an image, [math]x[/math] refers to the image itself, and [math]y[/math] refers to a complex label, which contains both the objects and the relations between objects. As with most other inference methods, the goal is to find the label [math]y*[/math] such that [math]s(x,y)[/math] is maximized. However, the major concern is that the space for possible label assignments grows exponentially with respect to input size. For example, although an image may seem very simple, the corpus containing possible labels for objects may be very large, rendering it difficult to optimize the scoring function.

The paper presents an alternative approach, for which input [math]x[/math] is mapped to structured output [math]y[/math] using a "black box" neural network, omitting the definition of a score function. The main concern for this approach is the determination of the network architecture.