conditional neural process: Difference between revisions
(→Model) |
|||
Line 16: | Line 16: | ||
== Conditional Neural Process == | == Conditional Neural Process == | ||
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over | Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over <math display="inline">f(T)</math> given a distributed representation of <math display="inline">O</math> of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability. | ||
CNP is a conditional stochastic process <math display="inline">Q_\theta</math> defines distributions over <math display="inline">f(x_i)</math> for <math display="inline">x_i \in T</math>. For stochastic processs, we assume <math display="inline">Q_theta</math> is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to <math display="inline">T</math> be assuming a factored structure. That is, <math display="inline">Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x)</math> | |||
In detail, we use the following archiecture | |||
<math display="inline">r_i = h_\theta(x_i, y_i)</math> for any <math display="inline">(x_i, y_i) \in O</math> | |||
<math display="inline">r = r_i * r_2 * ... * r_n</math>, where <math display="inline">*</math> is a commutative operation that takes elements in <math display="inline">\mathbb{R}^d</math> | |||
networks, ⊕ is a commutative operation that takes elements | |||
in R | |||
d | |||
and maps them into a single element of R | |||
d | |||
, and φi are | |||
parameters for Qθ(f(xi)| O, xi) = Q(f(xi)| φi). Depending | |||
on the task the model learns to parametrize a different | |||
output distribution. This architecture ensures permutation | |||
invariance and O(n + m) scaling for conditional prediction. | |||
We note that, since r1 ⊕ . . . ⊕ rn can be computed in O(1) | |||
from r1 ⊕ . . . ⊕ rn−1, this architecture supports streaming | |||
observations with minimal overhead. | |||
For regression tasks we use φi | |||
to parametrize the mean and | |||
variance φi = (µi | |||
, σ2 | |||
i | |||
) of a Gaussian distribution N (µi | |||
, σ2 | |||
i | |||
) | |||
for every xi ∈ T. For classification tasks φi parametrizes | |||
the logits of the class probabilities pc over the c classes of a | |||
categorical distribution. In most of our experiments we take | |||
a1 ⊕ . . . ⊕ an to be the mean operation (a1 + . . . + an)/n. | |||
2.3. Training CNPs | |||
We train Qθ by asking it to predict O conditioned on a randomly | |||
chosen subset of O. This gives the model a signal | |||
of the uncertainty over the space X inherent in the distribution | |||
P given a set of observations. More precisely, | |||
let f ∼ P, O = {(xi | |||
, yi)} | |||
n−1 | |||
i=0 be a set of observations, | |||
N ∼ uniform[0, . . . , n − 1]. We condition on the subset | |||
ON = {(xi | |||
, yi)} | |||
N | |||
i=0 ⊂ O, the first N elements of O. We | |||
minimize the negative conditional log probability | |||
L(θ) = −Ef∼P | |||
h | |||
EN | |||
h | |||
log Qθ({yi} | |||
n−1 | |||
i=0 |ON , {xi} | |||
n−1 | |||
i=0 ) | |||
ii | |||
(4) | |||
Thus, the targets it scores Qθ on include both the observed | |||
and unobserved values. In practice, we take Monte Carlo | |||
estimates of the gradient of this loss by sampling f and N. | |||
This approach shifts the burden of imposing prior knowledge | |||
from an analytic prior to empirical data. This has | |||
the advantage of liberating a practitioner from having to | |||
specify an analytic form for the prior, which is ultimately | |||
intended to summarize their empirical experience. Still, we | |||
emphasize that the Qθ are not necessarily a consistent set of | |||
conditionals for all observation sets, and the training routine | |||
does not guarantee that. In summary, | |||
1. A CNP is a conditional distribution over functions | |||
trained to model the empirical conditional distributions | |||
of functions f ∼ P. | |||
2. A CNP is permutation invariant in O and T. | |||
3. A CNP is scalable, achieving a running time complexity | |||
of O(n + m) for making m predictions with n | |||
observations. | |||
Within this specification of the model there are still some | |||
aspects that can be modified to suit specific requirements. | |||
The exact implementation of h, for example, can be adapted | |||
to the data type. For low dimensional data the encoder | |||
can be implemented as an MLP, whereas for inputs with | |||
larger dimensions and spatial correlations it can also include | |||
convolutions. Finally, in the setup described the model is not | |||
able to produce any coherent samples, as it learns to model | |||
only a factored prediction of the mean and the variances, | |||
disregarding the covariance between target points. This | |||
is a result of this particular implementation of the model. | |||
One way we can obtain coherent samples is by introducing | |||
a latent variable that we can sample from. We carry out | |||
some proof-of-concept experiments on such a model in | |||
section 4.2.3. |
Revision as of 02:43, 19 November 2018
Introduction
To train a model effectively, deep neural networks require large datasets. To mitigate this data efficiency problem, learning in two phases is one approach : the first phase learns the statistics of a generic domain without committing to a specific learning task; the second phase learns a function for a specific task, but does so using only a small number of data points by exploiting the domain-wide statistics already learned.
For example, consider a data set [math]\displaystyle{ \{x_i, y_i\} }[/math] with evaluations [math]\displaystyle{ y_i = f(x_i) }[/math] for some unknown function [math]\displaystyle{ f }[/math]. Assume [math]\displaystyle{ g }[/math] is an approximating function of f. The aim is yo minimize the loss between [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] on the entire space [math]\displaystyle{ X }[/math]. In practice, the routine is evaluated on a finite set of observations.
In this work, they proposed a family of models that represent solutions to the supervised problem, and ab end-to-end training approach to learning them, that combine neural networks with features reminiscent if Gaussian Process. They call this family of models Conditional Neural Processes.
Model
Let training set be [math]\displaystyle{ O = \{x_i, y_i\}_{i = 0} ^ n-1 }[/math], and test set be [math]\displaystyle{ T = \{x_i, y_i\}_{i = n} ^ {n + m - 1} }[/math].
P be a probability distribution over functions [math]\displaystyle{ F : X \to Y }[/math], formally known as a stochastic process. Thus, P defines a joint distribution over the random variables [math]\displaystyle{ {f(x_i)}_{i = 0} ^{n + m - 1} }[/math]. Therefore, for [math]\displaystyle{ P(f(x)|O, T) }[/math], our task is to predict the output values [math]\displaystyle{ f(x_i) }[/math] for [math]\displaystyle{ x_i \in T }[/math], given [math]\displaystyle{ O }[/math],
Conditional Neural Process
Conditional Neural Process models directly parametrize conditional stochastic processes without imposing consistency with respect to some prior process. CNP parametrize distributions over [math]\displaystyle{ f(T) }[/math] given a distributed representation of [math]\displaystyle{ O }[/math] of fixed dimensionality. Thus, the mathematical guarantees associated with stochastic processes is traded off for functional flexibility and scalability.
CNP is a conditional stochastic process [math]\displaystyle{ Q_\theta }[/math] defines distributions over [math]\displaystyle{ f(x_i) }[/math] for [math]\displaystyle{ x_i \in T }[/math]. For stochastic processs, we assume [math]\displaystyle{ Q_theta }[/math] is invariant to permutations, and in this work, we generally enforce permutation invariance with respect to [math]\displaystyle{ T }[/math] be assuming a factored structure. That is, [math]\displaystyle{ Q_theta(f(T) | O, T) = \prod _{x \in T} Q_\theta(f(x) | O, x) }[/math]
In detail, we use the following archiecture
[math]\displaystyle{ r_i = h_\theta(x_i, y_i) }[/math] for any [math]\displaystyle{ (x_i, y_i) \in O }[/math]
[math]\displaystyle{ r = r_i * r_2 * ... * r_n }[/math], where [math]\displaystyle{ * }[/math] is a commutative operation that takes elements in [math]\displaystyle{ \mathbb{R}^d }[/math]
networks, ⊕ is a commutative operation that takes elements in R d and maps them into a single element of R d , and φi are parameters for Qθ(f(xi)| O, xi) = Q(f(xi)| φi). Depending on the task the model learns to parametrize a different output distribution. This architecture ensures permutation invariance and O(n + m) scaling for conditional prediction. We note that, since r1 ⊕ . . . ⊕ rn can be computed in O(1) from r1 ⊕ . . . ⊕ rn−1, this architecture supports streaming observations with minimal overhead. For regression tasks we use φi to parametrize the mean and variance φi = (µi , σ2 i ) of a Gaussian distribution N (µi , σ2 i ) for every xi ∈ T. For classification tasks φi parametrizes the logits of the class probabilities pc over the c classes of a categorical distribution. In most of our experiments we take a1 ⊕ . . . ⊕ an to be the mean operation (a1 + . . . + an)/n. 2.3. Training CNPs We train Qθ by asking it to predict O conditioned on a randomly chosen subset of O. This gives the model a signal of the uncertainty over the space X inherent in the distribution P given a set of observations. More precisely, let f ∼ P, O = {(xi , yi)} n−1 i=0 be a set of observations, N ∼ uniform[0, . . . , n − 1]. We condition on the subset ON = {(xi , yi)} N i=0 ⊂ O, the first N elements of O. We minimize the negative conditional log probability L(θ) = −Ef∼P h EN h log Qθ({yi} n−1 i=0 |ON , {xi} n−1 i=0 ) ii (4) Thus, the targets it scores Qθ on include both the observed and unobserved values. In practice, we take Monte Carlo estimates of the gradient of this loss by sampling f and N. This approach shifts the burden of imposing prior knowledge from an analytic prior to empirical data. This has the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the Qθ are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that. In summary, 1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ∼ P. 2. A CNP is permutation invariant in O and T. 3. A CNP is scalable, achieving a running time complexity of O(n + m) for making m predictions with n observations. Within this specification of the model there are still some aspects that can be modified to suit specific requirements. The exact implementation of h, for example, can be adapted to the data type. For low dimensional data the encoder can be implemented as an MLP, whereas for inputs with larger dimensions and spatial correlations it can also include convolutions. Finally, in the setup described the model is not able to produce any coherent samples, as it learns to model only a factored prediction of the mean and the variances, disregarding the covariance between target points. This is a result of this particular implementation of the model. One way we can obtain coherent samples is by introducing a latent variable that we can sample from. We carry out some proof-of-concept experiments on such a model in section 4.2.3.