User:Gtompkin
Presented by
Grace Tompkins, Tatiana Krikella, Swaleh Hussain
Introduction
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.
Related Work
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and [math]\displaystyle{ k }[/math]-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.
Layer for Processing Missing Data
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation.
Let [math]\displaystyle{ (x,J) }[/math] represent a missing data point, where [math]\displaystyle{ x \in \mathbb{R}^D }[/math], and [math]\displaystyle{ J \subset {1,...,D} }[/math] is a set of attributes with missing data.
For each missing point [math]\displaystyle{ (x,J) }[/math], define an affine subspace consisting of all points which coincide with [math]\displaystyle{ x }[/math] on known coordinates [math]\displaystyle{ J'=\{1,…,N\}/J }[/math]:
where [math]\displaystyle{ e_J=[e_j]_{j\in J} }[/math] and [math]\displaystyle{ e_j }[/math] is the [math]\displaystyle{ j^{th} }[/math] canonical vector in [math]\displaystyle{ \mathbb{R}^D }[/math].
Assume that the missing data points come from the D-dimensional probability distribution, [math]\displaystyle{ F }[/math]. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points [math]\displaystyle{ (x,J) }[/math], the density [math]\displaystyle{ F }[/math] is restricted to the affine subspace [math]\displaystyle{ S }[/math]. Thus, possible values of [math]\displaystyle{ (x,J) }[/math] are modelled using the conditional density [math]\displaystyle{ F_S: S \to \mathbb{R} }[/math],
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function [math]\displaystyle{ F_S }[/math], the authors define the generalized activation of a neuron [math]\displaystyle{ n: \mathbb{R}^D \to \mathbb{R} }[/math] on [math]\displaystyle{ F_S }[/math] as:
provided that the expectation exists.
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively.
Theorem 3.1 Let [math]\displaystyle{ F = \sum_i{p_iN(m_i, \Sigma_i)} }[/math] be the mixture of (possibly degenerate) Gaussians. Given weights [math]\displaystyle{ w=(w_1, ..., w_D) \in \mathbb{R}^D, }[/math][math]\displaystyle{ b \in \mathbb{R} }[/math], we have
where [math]\displaystyle{ NR(x)=\text{ReLU}[N(x,1)] }[/math] and [math]\displaystyle{ \text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0) }[/math], [math]\displaystyle{ w \in \mathbb{R}^D }[/math] and [math]\displaystyle{ b \in \mathbb{R} }[/math] is the bias.
Theorem 3.2 Let [math]\displaystyle{ F = \sum_i{p_iN(m_i, \Sigma_i)} }[/math] be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by [math]\displaystyle{ N(c, \Gamma) }[/math]. We have:
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution [math]\displaystyle{ F }[/math] is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.
[math]\displaystyle{ }[/math]
Theoretical Analysis
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information.
Let the generalized response of a neuron [math]\displaystyle{ n: \mathbb{R}^D \rightarrow \mathbb{R} }[/math] evaluated on a probability measure [math]\displaystyle{ \mu }[/math] which is given by [math]\displaystyle{ n(\mu) := \int n(x)d\mu(x) }[/math].
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows.
Theorem 4.1. Let [math]\displaystyle{ \mu }[/math], [math]\displaystyle{ v }[/math] be probabilistic measures satisfying [math]\displaystyle{ \int ||x|| d \mu(x) \lt \infty }[/math]. If [math]\displaystyle{ ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R} \text{ then } v = \mu }[/math]
Sketch of Proof Let [math]\displaystyle{ w \in \mathbb{R}^D }[/math] be fixed and define the set [math]\displaystyle{ F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\} }[/math]. The first step of the proof involves showing that [math]\displaystyle{ F_w }[/math] contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, [math]\displaystyle{ Q }[/math], is in the set [math]\displaystyle{ F_w }[/math]. This involves re-writing [math]\displaystyle{ Q }[/math] as a sum of tent-like piecewise linear functions, [math]\displaystyle{ T }[/math] and showing that [math]\displaystyle{ T \in F_w }[/math] (since it is sufficient to only show [math]\displaystyle{ T \in F_w }[/math]).
Next, the authors show that an arbitrary bounded continuous function [math]\displaystyle{ G }[/math] is in [math]\displaystyle{ F_w }[/math] by the Lebesgue dominated convergence theorem.
Then, as [math]\displaystyle{ cos(\cdot), sin(\cdot) \in F_w }[/math], the function [math]\displaystyle{ exp(ir) = cos(r) + sin(r) \in F_w }[/math] and we have the equality [math]\displaystyle{ \int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x) }[/math]. Since [math]\displaystyle{ w }[/math] was arbitrarily chosen, we can conclude that [math]\displaystyle{ \mu = \nu }[/math] as the characteristic functions of the two measures coincide.
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.
Experimental Results
Conclusion
Critiques
References
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous data. In Advances in neural information processing systems, pages 395–401, 1996.
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.