conditional neural process: Difference between revisions
Line 37: | Line 37: | ||
estimates of the gradient of this loss by sampling f and N. | estimates of the gradient of this loss by sampling f and N. | ||
This approach shifts the burden of imposing prior knowledge | This approach shifts the burden of imposing prior knowledge | ||
from an analytic prior to empirical data. This has | from an analytic prior to empirical data. This has | ||
the advantage of liberating a practitioner from having to | the advantage of liberating a practitioner from having to | ||
Line 56: | Line 59: | ||
of O(n + m) for making m predictions with n | of O(n + m) for making m predictions with n | ||
observations. | observations. | ||
== Experimental Results == |
Revision as of 02:56, 19 November 2018
Introduction
To train a model effectively, deep neural networks require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach : the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned.
For example, consider a data set [math]\displaystyle{ \{x_i, y_i\} }[/math] with evaluations [math]\displaystyle{ y_i = f(x_i) }[/math] for some unknown function [math]\displaystyle{ f }[/math]. Assume [math]\displaystyle{ g }[/math] is an approximating function of f. The aim is yo minimize the loss between [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] on the entire space [math]\displaystyle{ X }[/math]. In practice, the routine is evaluated on a finite set of observations.
In this work, they proposed a family of models that represent solutions to the supervised problem, and ab end-to-end training approach to learning them, that combine neural networks with features reminiscent if Gaussian Process. They call this family of models Conditional Neural Processes.
Model
Let training set be [math]\displaystyle{ O = \{x_i, y_i\}_{i = 0} ^ n-1 }[/math], and test set be [math]\displaystyle{ T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} }[/math].
P be a probability distribution over functions [math]\displaystyle{ F : X \to Y }[/math], formally known as a stochastic process. Thus, P defines a joint distribution over the random variables [math]\displaystyle{ {f(x_i)}_{i = 0} ^{n + m - 1} }[/math]. Therefore, for [math]\displaystyle{ P(f(x)|O, T) }[/math], our task is to predict the output values [math]\displaystyle{ f(x_i) }[/math] for [math]\displaystyle{ x_i \in T }[/math], given [math]\displaystyle{ O }[/math],
Conditional Neural Process
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over [math]\displaystyle{ f(T) }[/math] given a distributed representation of [math]\displaystyle{ O }[/math] of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.
CNP is a conditional stochastic process [math]\displaystyle{ Q_\theta }[/math] defines distributions over [math]\displaystyle{ f(x_i) }[/math] for [math]\displaystyle{ x_i \in T }[/math]. For stochastic processs, we assume [math]\displaystyle{ Q_theta }[/math] is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to [math]\displaystyle{ T }[/math] be assuming a factored structure. That is, [math]\displaystyle{ Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x) }[/math]
In detail, we use the following archiecture
[math]\displaystyle{ r_i = h_\theta(x_i, y_i) }[/math] for any [math]\displaystyle{ (x_i, y_i) \in O }[/math], where [math]\displaystyle{ h_\theta : X \times Y \to \mathbb{R} ^ d }[/math]
[math]\displaystyle{ r = r_i * r_2 * ... * r_n }[/math], where [math]\displaystyle{ * }[/math] is a commutative operation that takes elements in [math]\displaystyle{ \mathbb{R}^d }[/math] and maps them into a single element of [math]\displaystyle{ \mathbb{R} ^ d }[/math]
[math]\displaystyle{ \Phi_i = g_\theta }[/math] for any [math]\displaystyle{ x_i \in T }[/math], where [math]\displaystyle{ g_\theta : X \times \mathbb{R} ^ d \to \mathbb{R} ^ e }[/math] and [math]\displaystyle{ \Phi_i }[/math] are parameters for [math]\displaystyle{ Q_\theta }[/math]
Note that this architecture ensures permutation invariance and [math]\displaystyle{ O(n + m) }[/math] scaling for conditional prediction. Also, [math]\displaystyle{ r = r_i * r_2 * ... * r_n }[/math] can be computed in [math]\displaystyle{ O(n) }[/math], this architecture supports streaming observation with minimal overhead.
We train [math]\displaystyle{ Q_\theta }[/math] by asking it to predict [math]\displaystyle{ O }[/math] conditioned on a randomly
chosen subset of [math]\displaystyle{ O }[/math]. This gives the model a signal of the uncertainty over the space X inherent in the distribution
P given a set of observations. Thus, the targets it scores [math]\displaystyle{ Q_\theta }[/math] on include both the observed
and unobserved values. In practice, we take Monte Carlo
estimates of the gradient of this loss by sampling f and N.
This approach shifts the burden of imposing prior knowledge
from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the [math]\displaystyle{ Q_\theta }[/math] are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that.
In summary,
1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ∼ P.
2. A CNP is permutation invariant in O and T.
3. A CNP is scalable, achieving a running time complexity of O(n + m) for making m predictions with n observations.