the Wake-Sleep Algorithm for Unsupervised Neural Networks
Introduction
In considering general learning procedures, supervised methods for neural networks are limited in that they can only be executed in specifically-structured environments. For these systems to learn, the environment must be equipped with an external "teacher" providing the network with explicit feedback on its predictive performance. From there, the system needs a means for circulating this error information across the entire network, so that the weights can be adjusted accordingly. To apply neural networks in contexts not satisfying these specific requirements, the authors purpose the wake-sleep algorithm, a two-phase procedure in which each network layer effectively learns representations of the activity in adjacent hidden layers. Here, the network is composed of feed-forward "recognition" connections used to generate an internal representation of the input, and feed-back generative connections used to produce an estimated reconstruction of the original input based on this learned internal representation. The goal is to learn an efficient representation which accurately characterizes the input to the system.
Model Structure
The wake-sleep algorithm is used to train Helmholtz machines, networks possessing the feed-forward and feed-back connections as described above. In deploying a Helmholtz machine, we hope to encode an abstraction capturing the relevant properties of the data, and using this representation, find a generative reconstruction of the original input. This is analogous to learning a data-driven "bottom-up" representation of the input that is used to inform the higher-order "top-down" reconstruction.
Selecting a Cost Function
To enforce the requirement that the network produces efficient reconstructions of the data, the cost function is selected by viewing the problem as a task of information transmission. In this task, the original input vector is to be indirectly communicated from the sender to the receiver via first sending the internal representation of the datum learned by the system, and then passing along the deviation of the original input from its approximation produced by the generative reconstruction of the internal representation. Naturally, the objective then becomes to minimize the length of the sequence of bits that is needed to express the original input in this indirect manner. This corresponds to adopting the minimum description length (MDL) principle to guide the process of learning the representation. MDL is a general methodological principle stating that among a set of candidate models for the data, the one which can be represented in the fewest number of bits in the process of communication ought to be selected (see http://papers.nips.cc/paper/798-autoencoders-minimum-description-length-and-helmholtz-free-energy.pdf).
In order to see how the MDL criterion is to be implemented in this context, we must first specify a more precise network structure and the procedure which generates the reconstruction. The authors restrict the network to consist of stochastic binary units taking values in {0,1}, where the probability of unit v being active is given by
- [math]\displaystyle{ P(s_v = 1) = (1 + exp(-b_v - \sum_{u}^{} s_u w_{uv}))^{-1} }[/math] (1)
- [math]\displaystyle{ P(s_v = 1) = (1 + exp(-b_v - \sum_{u}^{} s_u w_{uv}))^{-1} }[/math] (1)
Here, [math]\displaystyle{ b_v }[/math] is the additive constant for unit v, and [math]\displaystyle{ w_{uv} }[/math] is the weight associated with the connection to unit u. For the bottom-up recognition connections, the units u which are summed over are from the immediately-preceding hidden layer, whereas these units will be from the immediately-following layer for the top-down generative connections.
Now, to produce a reconstruction of a particular input, we first feed the input into the network, using the recognition weights to activate the units of each successive layer until the highest hidden layer is reached. With multiple hidden layers, this process is analogous to creating a hierarchy of representations; the first hidden layer generates a representation of the original input, the second layer produces a representation of the representation provided to it by the first layer, and each additional layer similarly provides a higher-order meta-representation of the input. At the final layer, the network begins the process of using this set of representations to express the approximation via the generative connections. To do this, for each unit of the top layer, the sigmoid function is simply applied to the additive constant in (1), as there are no subsequent layers providing influence. Then, for each lower layer, the activation probabilities of its units are computed using (1), where the u 's are the set of units in the next-highest layer.
The collection of states [math]\displaystyle{ s_j }[/math] of the binary units with corresponding activation probabilities [math]\displaystyle{ p_j }[/math] for the representation produced by this process now gives us a means to quantify the description length of the representation. We know that in order to represent the state of the stochastic binary unit j, we require [math]\displaystyle{ C(s_j) := -s_jlog p_j - (1 - s_j)log (1 - p_j) }[/math] bits. So, for an input d, the total description length for representing d in terms of its learned representation [math]\displaystyle{ \alpha }[/math] is the sum of the description lengths for the individual hidden units of the network plus the description length of the procedure for decoding the original input given the hidden representation:
- [math]\displaystyle{ C(\alpha , d) = \sum_{j}^{}C(s_j^{\alpha}) + \sum_{i}^{}C(s_i^d | d) }[/math]
- [math]\displaystyle{ C(\alpha , d) = \sum_{j}^{}C(s_j^{\alpha}) + \sum_{i}^{}C(s_i^d | d) }[/math]
Recalling that the hidden units are stochastic, it is clear that a given input vector will have a broad array of possible network representations, making the recognition weights specify a distribution [math]\displaystyle{ Q(\alpha | d) }[/math] over representations of the input. Taking this into account, we realize that the cost function for representing an input d will be a representation cost averaged over the possible representations of d. If we also consider the entropy associated with the variability in representing d, the overall cost for representing d becomes [math]\displaystyle{ C(d) = \sum_{\alpha}^{} Q(\alpha | d)C(\alpha , d) - (- \sum_{\alpha}^{} Q(\alpha | d)log Q(\alpha | d)) }[/math]; namely, the expected description length for representing d subtracted by the entropy in representation.
Minimizing the Cost
It turns out that the representation cost we specified is related to the notion Helmholtz free energy in statistical physics. Analogous to this context, we find that the cost is minimized by setting the probability of a representation to be proportionate to an exponential function of its cost:
- [math]\displaystyle{ P(\alpha | d) = \frac {exp(-C(\alpha, d))}{\sum_{\beta}^{} exp(-C(\beta, d))} }[/math]
This corresponds to the Boltzmann distribution with temperature parameter set to 1. Essentially, this means that rather than attempting to minimize the description cost of any single particular representation of an input, we should seek to find recognition weights which allows [math]\displaystyle{ Q( \alpha |d) }[/math] to be a good approximation of this Boltzmann distribution. Given this cost and general criterion for minimizing it, we can now turn to the two-phase procedure used to determine the network weights.
The Wake Phase
As suggested by the name, in this stage, the network is receptive to external input. The system will receive input vectors which will be converted into internal representations across the hidden layers. To allow the input to propagate forwards through the network, the bottom-up recognition weights must be held fixed. Consequently, after an input vector has been processed in a bottom-up manner, we naturally seek to minimize top-down reconstruction error, the error incurred in using a layer's learned representation to predict the configuration in the layer below it. This is accomplished by updating the generative weights using the derivative of the description cost [math]\displaystyle{ C( \alpha , d) }[/math] with respect to the generative weights, for the learned representation [math]\displaystyle{ \alpha }[/math]. Using the purely local delta rule, this term for the top-down weight on the connection from k to j becomes:
- [math]\displaystyle{ \Delta w_{kj} = \epsilon s_k^{\alpha}(s_j^{\alpha} - p_j^{\alpha}) }[/math].
We see that this term encourages each hidden layer to improve its performance in predicting the states of the layer immediately beneath it.