cardinality Restricted Boltzmann Machines: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 62: Line 62:
<center>
<center>
<math>
<math>
J(\mathbf{x}) &=& \mathbb{E}_{P(h|v)}[hh^{T}]-\mathbb{E}_{P(h|v)}[h]\mathbb{E}_{P(h|v)}[h^{T}]\\
J(\mathbf{x})=\mathbb{E}_{P(h|v)}[hh^{T}]-\mathbb{E}_{P(h|v)}[h]\mathbb{E}_{P(h|v)}[h^{T}]=\mathbb{E}_{P(h|v)}[hh^{T}]-\mu(\mathbf{x})\mu(\mathbf{x})^{T}
    &=& \mathbb{E}_{P(h|v)}[hh^{T}]-\mu(\mathbf{x})\mu(\mathbf{x})^{T}
</math>
</math>
</center>
</center>
Line 69: Line 68:
<center>
<center>
<math>
<math>
\frac{\partial L}{\partial \mathbf{x}}=\frac{\partial \mu^{T}}{\partial \mathbf{x}}\frac{\partial L}{\partial \mu}=J(\mathbf{x})^{T}\frac{partial L}{\partial \mu}
\frac{\partial L}{\partial \mathbf{x}}=\frac{\partial \mu^{T}}{\partial \mathbf{x}}\frac{\partial L}{\partial \mu}=J(\mathbf{x})^{T}\frac{\partial L}{\partial \mu}
</math>
</math>
</center>
</center>
Line 75: Line 74:
<center>
<center>
<math>
<math>
\lim_{\epsilon \rightarrow 0}\frac{f(x+\epsilon \ell)-f(x)}{\epsilon} &=& \lim_{\epsilon \rightarrow 0}\frac{f(x)+\epsilon J \ell+o(\epsilon)-f(x)}{\epsilon}\\
\lim_{\epsilon \rightarrow 0}\frac{f(x+\epsilon \ell)-f(x)}{\epsilon}=\lim_{\epsilon \rightarrow 0}\frac{f(x)+\epsilon J \ell+o(\epsilon)-f(x)}{\epsilon}=J \ell
&=& J \ell
</math>
</math>
</center>
</center>

Revision as of 17:43, 10 August 2013

Overview

Restricted Boltzmann Machine (RBM) is a probabilistic model which is usually represented by a graphical model. RBM can encode a probability distribution and therefore can be used for density estimation. It can also be viewed as a generative stochastic model of a neural network (with one hidden layer). Although interesting by itself, popularity of RBM started from the time that Hinton and others used it as the building blocks of Deep Belief Networks (DBNs). From the computational point of view, the good characteristic of RBN is that the posterior over hidden variables is factorizable.

At least two notions of sparsity can be investigated for graphical models. The first notion is the sparsity of the graph (i.e., something related to the number of edges in the graph). In this sense, a sparse model is one that has a lot of independencies in its structure. However, this is not the focus of this paper. In fact, the structure of RBM is fixed.

The other notion of sparsity is the usual definition in the other approaches. In the neural network literature, it means that we do not want the entire neurons to be activated in the same time. For our RBM, we want that the values of hidden variables be sparse for (almost) all inputs.

Motivations and Challenges

Sparsity is usually regarded as a good property for modeling complex phenomenons. From theoretical point of view, statisticians sometimes relate it to the idea of shrinkage. Also, experimental analysis shows that it is certainly useful. In particular, it has been suggested that sparse deep architectures can be very efficient. Finally, sparsity increases the interpretability of data.

The problem is for RBM is that if we want to add sparsity constraints, then the hidden variables would be coupled together. Therefore, the posterior would not be factorizable anymore, which makes sparse RBM intractable.

RBM Formulation

Assume [math]\displaystyle{ v \in \{0,1\}^{N_v} }[/math] and [math]\displaystyle{ h \in \{0,1\}^{N_h} }[/math] are the vectors of binary valued variables, corresponding to visible and hidden variables respectively. For RBM, the joint probability over these variables is defined as follows:

[math]\displaystyle{ P(v,h) = \frac{1}{Z} exp(v^{T}Wh+v^{T}b_{v}+h^{T}b_{h}) }[/math]

where [math]\displaystyle{ Z }[/math] is just a normalization factor and [math]\displaystyle{ W,b_{v},b_{h} }[/math] are the parameters.

One way to find the parameters is to optimize the likelihood function. However, finding an exact solution is not tractable. Therefore, an approximation of the gradient, called "contrastive divergence", is used for learning the parameters.

The Sparse RBM (SpRBM)

The idea of SpRBM is simple. We add a penalty term to the log-likelihood objective in order to encourage sparsity. The penalty term used in SpRBM is

[math]\displaystyle{ \frac{}{} \lambda (\rho \log{q_{j}} + (1 - \rho) \log{(1-q_{j})}) }[/math]

where [math]\displaystyle{ q_{j} = \frac{1}{N}\sum{P(h_{j}=1|v_{n})} }[/math] reflects the average unit marginal.

The Cardinality RBM (CaRBM)

The difference between CaRBM and the standard RBM is in the joint probability model. In particular, it has an additional factor

[math]\displaystyle{ P(v,h) = \frac{1}{Z} exp(v^{T}Wh+v^{T}b_{v}+h^{T}b_{h}). \psi_k (\sum^{N_h}_{j=1} h_j ) }[/math]

where [math]\displaystyle{ \psi_k (.) }[/math] is the potential function and [math]\displaystyle{ \psi_k (c)=1 }[/math] if [math]\displaystyle{ c \le 1 }[/math]. Therefore, the potential function ensures that the hidden variables are sparse with sparsity parameter equal to [math]\displaystyle{ k }[/math].

Note that [math]\displaystyle{ P(v|h) }[/math] is still easy to compute. However, computing the posterior [math]\displaystyle{ P(h|v) }[/math] is not easy anymore. Fortunately, there is an efficient inference algorithm for computing the marginals and therefore the posterior. This algorithm can be regarded as a dynamic programming; also, it can be interpreted as the sum-product algorithm. The authors prefer the sum-product explanation and define some auxiliary variables [math]\displaystyle{ z_j }[/math] that encode the sum of the first [math]\displaystyle{ j }[/math] hidden variables. Now the new variables are easier to use for the cardinality potential due to the chain-type dependency between them. Finally, we have a chain-structured graphical model and we can perform exact inference using sum-product algorithm.

Using CaRBM for DBNs

The original DBN builds upon layers of RBMs. These layers are trained layer-by-layer in a greedy and unsupervised manner. Then, the learned parameters are used to initialize a typical gradient-descent algorithm for learning deep neural networks. The challenge for doing this is to select an appropriate non-linearity in the deep neural network. We can not simply use an arbitrary function (e.g., hyperbolic tangent) because in that case the solution of RBM will not be a good initialization for the deep network.

The non-linear function [math]\displaystyle{ \mu }[/math] should be selected in way that

[math]\displaystyle{ \mu (W^{T}v+b_{h}) = \mathbb{E}_{P(h|v)}[h] }[/math]

For RBM, this results in a simple sigmoid function for [math]\displaystyle{ \mu }[/math]. However, for CaRBM the solution is not clear. In fact, there does not seem to be a closed-form solution. Instead, we can still encorporate the above formula (i.e., the expectation of the hidden variables) and use message-passing algorithm as the non-linearity.

Furthermore, to compute the gradient is conputationaly expesive ([math]\displaystyle{ O(N^{2}) }[/math]) because of the back-propagation through [math]\displaystyle{ \mu }[/math]. The authors suggest using a finite difference method for computing the Jacobian and therefore the gradient.

Let [math]\displaystyle{ \mathbf{x}=W^{T}v+b_{h} }[/math] be the total input to the RBM's hidden units. The Jacobian is calculated as:

[math]\displaystyle{ J(\mathbf{x})=\mathbb{E}_{P(h|v)}[hh^{T}]-\mathbb{E}_{P(h|v)}[h]\mathbb{E}_{P(h|v)}[h^{T}]=\mathbb{E}_{P(h|v)}[hh^{T}]-\mu(\mathbf{x})\mu(\mathbf{x})^{T} }[/math]

By chain rule, the gradient of the loss function is

[math]\displaystyle{ \frac{\partial L}{\partial \mathbf{x}}=\frac{\partial \mu^{T}}{\partial \mathbf{x}}\frac{\partial L}{\partial \mu}=J(\mathbf{x})^{T}\frac{\partial L}{\partial \mu} }[/math]

Note that the Jacobian is symmetric, and it is easy to multiply the Jacobian using the numerical differentiation. Thus, for any differentiable function [math]\displaystyle{ f(x) }[/math] and any vector [math]\displaystyle{ \ell }[/math],

[math]\displaystyle{ \lim_{\epsilon \rightarrow 0}\frac{f(x+\epsilon \ell)-f(x)}{\epsilon}=\lim_{\epsilon \rightarrow 0}\frac{f(x)+\epsilon J \ell+o(\epsilon)-f(x)}{\epsilon}=J \ell }[/math]

Since [math]\displaystyle{ \mu }[/math] is a differentiable function, we can backpropagate a vector of derivatives [math]\displaystyle{ \ell=\partial L/\partial \mu }[/math] as:

[math]\displaystyle{ J(\mathbf{x})\ell \approx \frac{\mu(\mathbf{x}+\epsilon \ell)-\mu(\mathbf{x}-\epsilon \ell)}{2 \epsilon} }[/math]

This approach provides the best balance between speed and accuracy.

Experiments

In the fist experiment, the test errors of RBM, SpRBM and CaRBM are compared for different datasets. The results are shown bellow. Note that CaRBM is not superior to the other methods. Also, the statistical significance of the results is not reported. File:carbm.png

The next experiment is on a NIPS dataset for finding latent topics. The three selected topics are reported in the picture below.

File:carbm2.png

In the next experiment, we can see the effect of the population sparsity parameter on the performance of SpRBM. As we increase [math]\displaystyle{ \lambda }[/math], the acquired results tends to be more sparse.

File:carbm3.png

Conclusions

In this paper, the idea of using a cardinality potential in the joint probability of the RBM was proposed. This modification in the model resulted in sparse hidden variables. However, this modification made the inference harder. The authors tackled this problem by introducing auxiliary variables to create a chain-structured model.

Furthermore, they showed how to use CaRBM in a DBN. First of all, they argued that we should use the marginals as the non-linear units of the neural network. Second, they used a finite-difference method to approximate the gradient.

The results were generally similar to SpRBM, even slightly worse. However, in some cases it was argued that SpRBM has difficulty to accommodate the high variability of sparsity and CaRBM is superior in those cases.