# Label-Free Supervision of Neural Networks with Physics and Domain Knowledge

(**NOT COMPLETE YET**)

## Introduction

Applications of machine learning are often encumbered by the need for large amounts of labeled training data. Neural networks have made large amounts of labeled data even more crucial to success (Krizhevsky, Sutskever, and Hinton 2012; LeCun, Bengio, and Hinton 2015). Nonetheless, Humans are often able to learn without direct examples, opting instead for high level instructions for how a task should be performed, or what it will look like when completed. This work explores whether a similar principle can be applied to teaching machines: can we supervise networks without individual examples by instead describing only the structure of desired outputs.

Unsupervised learning methods such as autoencoders, also aim to uncover hidden structure in the data without having access to any label. Such systems succeed in producing highly compressed, yet informative representations of the inputs (Kingma and Welling 2013; Le 2013). However, these representations differ from ours as they are not explicitly constrained to have a particular meaning or semantics. This paper attempts to explicitly provide the semantics of the hidden variables we hope to discover, but still train without labels by learning from constraints that are known to hold according to prior domain knowledge. By training without direct examples of the values our hidden (output) variables take, several advantages are gained over traditional supervised learning, including:

- a reduction in the amount of work spent labeling,
- an increase in generality, as a single set of constraints can be applied to multiple data sets without relabeling.

## Problem Setup

In a traditional supervised learning setting, we are given a training set [math]D=\{(x_1, y_1), \cdots, (x_n, y_n)\}[/math] of [math]n[/math] training examples. Each example is a pair [math](x_i,y_i)[/math] formed by an instance [math]x_i \in X[/math] and the corresponding output (label) [math]y_i \in Y[/math]. The goal is to learn a function [math]f: X \rightarrow Y[/math] mapping inputs to outputs. To quantify performance, a loss function [math]\ell:Y \times Y \rightarrow \mathbb{R}[/math] is provided, and a mapping is found via

- [math] f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) [/math]

where the optimization is over a pre-defined class of functions [math]\mathcal{F}[/math] (hypothesis class). In our case, [math]\mathcal{F}[/math] will be (convolutional) neural networks parameterized by their weights. The loss could be for example [math]\ell(f(x_i),y_i) = 1[f(x_i) \neq y_i][/math]. By restricting the space of possible functions specifying the hypothesis class [math]\mathcal{F}[/math], we are leveraging prior knowledge about the specific problem we are trying to solve. Informally, the so-called No Free Lunch Theorems state that every machine learning algorithm must make such assumptions in order to work. Another common way in which a modeler incorporates prior knowledge is by specifying an a-priori preference for certain functions in [math]\mathcal{F}[/math], incorporating a regularization term [math]R:\mathcal{F} \rightarrow \mathbb{R}[/math], and solving for [math] f^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n \ell(f(x_i),y_i) + R(f)[/math]. Typically, the regularization term [math]R:\mathcal{F} \rightarrow \mathbb{R}[/math] specifies a preference for "simpler' functions (Occam's razor).

In this paper, prior knowledge on the structure of the outputs is modelled by providing a weighted constraint function [math]g:X \times Y \rightarrow \mathbb{R}[/math], used to penalize “structures” that are not consistent with our prior knowledge. And whether this weak form of supervision is sufficient to learn interesting functions is explored. While one clearly needs labels [math]y[/math] to evaluate [math]f^*[/math], labels may not be necessary to discover [math]f^*[/math]. If prior knowledge informs us that outputs of [math]f^*[/math] have other unique properties among functions in [math]\mathcal{F}[/math], we may use these properties for training rather than direct examples [math]y[/math].

Specifically, an unsupervised approach where the labels [math]y_i[/math] are not provided to us is considered, where a necessary property of the output [math]g[/math] is optimized instead.

- [math]\hat{f}^* = argmin_{f \in \mathcal{F}} \sum_{i=1}^n g(x_i,f(x_i))+ R(f) [/math]