Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

(**NOT COMPLETE YET**)

Introduction

Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (Krizhevsky, Sutskever, and Hinton 2012; LeCun, Bengio, and Hinton 2015). Nonetheless, Humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.

Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:

• a reduction in the amount of work spent labeling,
• an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.

Problem Setup

In a traditional supervised learning setting, we are given a training set $D=\{(x_1, y_1), \cdots, (x_n, y_n)\}$ of $n$ training examples. Each example is a pair $(x_i,y_i)$ formed by an instance $x_i \in X$ and the corresponding output (label) $y_i \in Y$. The goal is to learn a function $f: X \rightarrow Y$ mapping inputs to outputs. To quantify performance, a loss function $\ell:Y \times Y \rightarrow \mathbb{R}$ is provided, and a mapping is found via

$f^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i)$

where the optimization is over a pre-defined class of functions $\mathcal{F}$ (hypothesis class). In our case, $\mathcal{F}$ will be (convolutional) neural networks parameterized by their weights. The loss could be for example $\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i]$. By restricting the space of possible functions specifying the hypothesis class $\mathcal{F}$, we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in $\mathcal{F}$, incorporating a regularization term $R:\mathcal{F} \rightarrow \mathbb{R}$, and solving for $f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)$. Typically, the regularization term $R:\mathcal{F} \rightarrow \mathbb{R}$ specifies a preference for "simpler' functions (Occam's razor).

In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function $g:X \times Y \rightarrow \mathbb{R}$, used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels $y$ to evaluate $f^*$, labels may not be necessary to discover $f^*$. If prior knowledge informs us that outputs of $f^*$ have other unique properties among functions in $\mathcal{F}$, we may use these properties for training rather than direct examples $y$.

Specifically, an unsupervised approach where the labels $y_i$ are not provided to us is considered, where a necessary property of the output $g$ is optimized instead.

$\hat{f}^* = \text{argmin}_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f)$

If the optimizing the above equation is sufficient to find $\hat{f}^*$, we can use it in replace of labels. If it's not sufficient, additional regularization terms are added. The idea is illustrated with three examples, as described in the next section.

Experiments

Tracking an object in free fall

In the first experiment, they record videos of an object being thrown across the field of view, and aim to learn the object's height in each frame. The goal is to obtain a regression network mapping from ${(R^{\text{height} \times \text{width} \times 3})}^N \rightarrow \mathbb{R}$, where $\text{height}$ and $\text{width}$ are the number of vertical and horizontal pixels per frame, and each pixel has 3 color channels. this network is trained as a structured prediction problem operating on a sequence of $N$ images to produce a sequence of $N$ heights, $\left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N$, and each piece of data $x_i$ will be a vector of images, $\mathbf{x}$. Rather than supervising the network with direct labels, $\mathbf{y} \in \mathbb{R}^N$, the network is instead supervised to find an object obeying the elementary physics of free falling objects. An object acting under gravity will have a fixed acceleration of $a = -9.8 m / s^2$, and the plot of the object's height over time will form a parabola:

$\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2$

The idea is, given any trajectory of $N$ height predictions, $f(\mathbf{x})$, we fit a parabola with fixed curvature to those predictions, and minimize the resulting residual. Formally, if we specify $\mathbf{a} = [\frac{1}{2} a\Delta t^2, \frac{1}{2} a(2 \Delta t)^2, \ldots, \frac{1}{2} a(N \Delta t)^2]$, the prediction produced by the fitted parabola is:

$\mathbf{\hat{y}} = \mathbf{a} + \mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T (f(\mathbf{x}) - \mathbf{a})$

where

$\mathbf{A} = \left[ {\begin{array}{*{20}c} \Delta t & 1 \\ 2\Delta t & 1 \\ 3\Delta t & 1 \\ \vdots & \vdots \\ N\Delta t & 1 \\ \end{array} } \right]$

The constraint loss is then defined as

$g(\mathbf{x},f(\mathbf{x})) = g(f(\mathbf{x})) = \sum_{i=1}^{N} |\mathbf{\hat{y}}_i - f(\mathbf{x})_i|$

Note that $\hat{y}$ is not the ground truth labels. Because $g$ is differentiable almost everywhere, it can be optimized with SGD. They find that when combined with existing regularization methods for neural networks, this optimization is sufficient to recover $f^*$ up to an additive constant $C$ (specifying what object height corresponds to 0).

The data set is collected on a laptop webcam running at 10 frames per second ($\Delta t = 0.1s$). The camera position is fixed and 65 diverse trajectories of the object in flight, totalling 602 images are recorded. For each trajectory, the network is trained on randomly selected intervals of $N=5$ contiguous frames. Images are resized to $56 \times 56$ pixels before going into a small, randomly initialized neural network with no pretraining. The network consists of 3 Conv/ReLU/MaxPool blocks followed by 2 Fully Connected/ReLU layers with probability 0.5 dropout and a single regression output.

Since scaling the $y_0$ and $v_0$ results in the same constraint loss $g$, the authors evaluate the result by the correlation of predicted heights with ground truth pixel measurements (which in my opinion is not a bullet proof evaluation, as described in the critique section). We see from the table below that, under their evaluation criteria, the result is pretty satisfying.

evaluation
Method Random Uniform Output Supervised with Labels Approach in this Paper
Correlation 12.1% 94.5% 90.1%

Tracking the position of a walking man

In the second experiment, they aim to detect the horizontal position of a person walking across a frame without providing direct labels $y \in \mathbb{R}$ by exploiting the assumption that the person will be walking at a constant velocity over short periods of time. This is formulated as a structured prediction problem $f: \left(R^{\text{height} \times \text{width} \times 3} \right)^N \rightarrow \mathbb{R}^N$, and each training instances $x_i$ are a vector of images, $\mathbf{x}$, being mapped to a sequence of predictions, $\mathbf{y}$. Given the similarities to the first experiment with free falling objects, we might hope to simply remove the gravity term from equation and retrain. However, in this case, that is not possible, as the constraint provides a necessary, but not sufficient, condition for convergence.

Given any sequence of correct outputs, [/itex](\mathbf{y}_1, \ldots, \mathbf{y}_N)[/itex], the modified sequence, [/itex](\lambda * \mathbf{y}_1 + C, \ldots, \lambda * \mathbf{y}_N + C)[/itex] ([/itex]\lambda, C \in \mathbbm{R}[/itex]) will also satisfy the constant velocity constraint. In the worst case, when [/itex]\lambda = 0[/itex], [/itex]f \equiv C[/itex], and the network can satisfy the constraint while having no dependence on the image. The trivial output is avoided by adding two two additional loss terms.

$h_1(\mathbf{x}) = -\text{std}(f(\mathbf{x}))$

which seeks to maximize the standard deviation of the output, and

$\begin{split} h_2(\mathbf{x}) = \hphantom{'} & \text{max}(\text{ReLU}(f(\mathbf{x}) - 10)) \hphantom{\text{ }}+ \\ & \text{max}(\text{ReLU}(0 - f(\mathbf{x}))) \end{split}$

which limit the output to a fixed ranged $[0, 10]$, the final loss is thus:

$\begin{split} g(\mathbf{x}) = \hphantom{'} & ||(\mathbf{A} (\mathbf{A}^T\mathbf{A})^{-1} \mathbf{A}^T - \mathbf{I}) * f(\mathbf{x})||_1 \hphantom{\text{ }}+ \\ & \gamma_1 * h_1(\mathbf{x}) \hphantom{\text{ }}+ \\ & \gamma_2 * h_2(\mathbf{x}) % h_2(y) & = \text{max}(\text{ReLU}(y - 10)) + \\ % & \hphantom{=}\hphantom{a} \text{max}(\text{ReLU}(0 - y)) \end{split}$

The data set contains 11 trajectories across 6 distinct scenes, totalling 507 images resized to $56 \times 56$. The network is trained to output linearly consistent positions on 5 strided frames from the first half of each trajectory, and is evaluated on the second half. The boundary violation penalty is set to $\gamma_2 = 0.8$ and the standard deviation bonus is set to $\gamma_1 = 0.6$.

As in the previous experiment, the result is evaluated by the correlation with the ground truth. The result is as follows:

evaluation
Method Random Uniform Output Supervised with Labels Approach in this Paper
Correlation 45.9% 80.5% 95.4%

Surprisingly, the approach in this paper beats the same network trained with direct labeled supervision on the test set, which can be attributed to overfitting on the small amount of training data available.

Conclusion and Critique

This paper has introduced a method for using physics and other domain constraints to supervise neural networks. However, the approach described in this paper are not entirely new. Similar ideas are already widely used in Q learning, where the Q value are not available, and the network is supervised by the constraint

$Q(s,a) = R(r,s) + \gamma \sum_{s' ~ P_{sa}}{\text{max}_{a'}Q(s',a')}$

ALso, the paper has a mistake where they quote the free fall equation as

$\mathbf{y}_i = y_0 + v_0(i\Delta t) + a(i\Delta t)^2$

which should be

$\mathbf{y}_i = y_0 + v_0(i\Delta t) + \frac{1}{2} a(i\Delta t)^2$

Although in this case it doesn't affect the result

For the evaluation of the experiments, they used correlation with ground truth as the metric to avoid the fact that the output can be scaled without affecting the constraint loss. This is fine if the network gives output of the same scale. However, there's no such guarantee, and the network may give output of varying scale, in which case, we can't say that the network has learnt the correct thing, although it may have a high correlation with ground truth. In fact, to solve the scaling issue, an obvious way is to combine the constraints introduced in this paper with some labeled training data. It's not clear why the author didn't experiment with a combination of these two loss.

Finally, this paper only picks the examples where the constraints are easy to design, while in some more popular tasks such as image classification, what kind of constraints are needed is not straightforward at all.