Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

From statwiki
Revision as of 21:45, 13 March 2018 by C433li (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

(**NOT COMPLETE YET**)

Introduction

Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (Krizhevsky, Sutskever, and Hinton 2012; LeCun, Bengio, and Hinton 2015). Nonetheless, Humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.

Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:

  • a reduction in the amount of work spent labeling,
  • an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.

Problem Setup

In a traditional supervised learning setting, we are given a training set [math]\displaystyle{ D=\{(x_1, y_1), \cdots, (x_n, y_n)\} }[/math] of [math]\displaystyle{ n }[/math] training examples. Each example is a pair [math]\displaystyle{ (x_i,y_i) }[/math] formed by an instance [math]\displaystyle{ x_i \in X }[/math] and the corresponding output (label) [math]\displaystyle{ y_i \in Y }[/math]. The goal is to learn a function [math]\displaystyle{ f: X \rightarrow Y }[/math] mapping inputs to outputs. To quantify performance, a loss function [math]\displaystyle{ \ell:Y \times Y \rightarrow \mathbb{R} }[/math] is provided, and a mapping is found via

[math]\displaystyle{ f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) }[/math]

where the optimization is over a pre-defined class of functions [math]\displaystyle{ \mathcal{F} }[/math] (hypothesis class). In our case, [math]\displaystyle{ \mathcal{F} }[/math] will be (convolutional) neural networks parameterized by their weights. The loss could be for example [math]\displaystyle{ \ell(f(x_i),y_i) = 1[f(x_i) \neq y_i] }[/math]. By restricting the space of possible functions specifying the hypothesis class [math]\displaystyle{ \mathcal{F} }[/math], we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in [math]\displaystyle{ \mathcal{F} }[/math], incorporating a regularization term [math]\displaystyle{ R:\mathcal{F} \rightarrow \mathbb{R} }[/math], and solving for [math]\displaystyle{ f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f) }[/math]. Typically, the regularization term [math]\displaystyle{ R:\mathcal{F} \rightarrow \mathbb{R} }[/math] specifies a preference for "simpler' functions (Occam's razor).

In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function [math]\displaystyle{ g:X \times Y \rightarrow \mathbb{R} }[/math], used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels [math]\displaystyle{ y }[/math] to evaluate [math]\displaystyle{ f^* }[/math], labels may not be necessary to discover [math]\displaystyle{ f^* }[/math]. If prior knowledge informs us that outputs of [math]\displaystyle{ f^* }[/math] have other unique properties among functions in [math]\displaystyle{ \mathcal{F} }[/math], we may use these properties for training rather than direct examples [math]\displaystyle{ y }[/math].

Specifically, an unsupervised approach where the labels [math]\displaystyle{ y_i }[/math] are not provided to us is considered, where a necessary property of the output [math]\displaystyle{ g }[/math] is optimized instead.

[math]\displaystyle{ \hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) }[/math]

If the optimizing the above equation is sufficient to find [math]\displaystyle{ \hat{f}^* }[/math], we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.

Experiments

Tracking an object in free fall

In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from [math]\displaystyle{ {(R^{\text{height} \times \text{width} \times 3})}^N \rightarrow \mathbb{R} }[/math], where [math]\displaystyle{ \text{height} }[/math] and [math]\displaystyle{ \text{width} }[/math] are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. this network is trained as a structured prediction problem operating on a sequence of [math]\displaystyle{ N }[/math] images to produce a sequence of [math]\displaystyle{ N }[/math] heights, [math]\displaystyle{ \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N }[/math], and each piece of data [math]\displaystyle{ x_i }[/math] will be a vector of images, [math]\displaystyle{ \mathbf{x} }[/math]. Rather than supervising the network with direct labels, [math]\displaystyle{ \mathbf{y} \in \mathbb{R}^N }[/math], the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of [math]\displaystyle{ a = -9.8 m / s^2 }[/math], and the plot of the object's height over time will form a parabola:

[math]\displaystyle{ \mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2 }[/math]

The idea is, given any trajectory of [math]\displaystyle{ N }[/math] height predictions, [math]\displaystyle{ f(\mathbf{x}) }[/math], we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify [math]\displaystyle{ \mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2] }[/math], the prediction produced by the fitted parabola is:

[math]\displaystyle{ \mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a}) }[/math]

where

[math]\displaystyle{ \mathbf{A} = \left[ {\begin{array}{*{20}c} \Delta t & 1 \\ 2\Delta t & 1 \\ 3\Delta t & 1 \\ \vdots & \vdots \\ N\Delta t & 1 \\ \end{array} } \right] }[/math]

The constraint loss is then defined as

[math]\displaystyle{ g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i| }[/math]

Note that [math]\displaystyle{ \hat{y} }[/math] is not the ground truth labels. Because [math]\displaystyle{ g }[/math] is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover [math]\displaystyle{ f^* }[/math] up to an additive constant [math]\displaystyle{ C }[/math] (specifying what object height corresponds to 0).

The data set is collected on a laptop webcam running at 10 frames per second ([math]\displaystyle{ \Delta t = 0.1s }[/math]). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of [math]\displaystyle{ N=5 }[/math] contiguous frames. Images are resized to [math]\displaystyle{ 56 \times 56 }[/math] pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.

Since scaling the [math]\displaystyle{ y_0 }[/math] and [math]\displaystyle{ v_0 }[/math] results in the same constraint loss [math]\displaystyle{ g }[/math], the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements (which in my opinion is not a bullet proof evaluation, as described in the critique section). We see from the table below that, under their evaluation criteria, the result is pretty satisfying.

evaluation
Method Random Uniform Output Supervised with Labels Approach in this Paper
Correlation 12.1% 94.5% 90.1%