http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Lxin&feedformat=atomstatwiki - User contributions [US]2024-03-28T08:07:50ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22940residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:46:15Z<p>Lxin: /* Low Rank Plus Sparse Inverse */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem <math>\frac{1}{p}YY^TS=\Sigma SD</math>, with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma_{x|y} = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math>.<br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Lambda^{-1})</math></center><br />
<br />
where <math>\Lambda</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to the E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Lambda ')^{-1}</math></center><br />
<br />
<center><math>\langle z_n|y_n \rangle = cov[z_n|y_n]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22939residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:42:55Z<p>Lxin: /* Low Rank Plus Sparse Inverse */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem <math>\frac{1}{p}YY^TS=\Sigma SD</math>, with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma_{x|y} = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math>.<br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Lambda^{-1})</math></center><br />
<br />
where <math>\Lambda</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22938residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:42:30Z<p>Lxin: /* Low Rank Plus Sparse Inverse */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem <math>\frac{1}{p}YY^TS=\Sigma SD</math>, with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma_{x|y} = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math>.<br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Lambda^{-1})</math></center><br />
<br />
where <math>\Sigma</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22937residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:38:38Z<p>Lxin: /* Maximum likelihood RCA */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem <math>\frac{1}{p}YY^TS=\Sigma SD</math>, with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma_{x|y} = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math>.<br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Sigma^{-1})</math></center><br />
<br />
where <math>\Sigma</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22936residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:35:50Z<p>Lxin: /* Maximum likelihood RCA */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem <math>\frac{1}{p}YY^TS=\Sigma SD</math>, with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math><br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Sigma^{-1})</math></center><br />
<br />
where <math>\Sigma</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22935residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:28:31Z<p>Lxin: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. The authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math><br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Sigma^{-1})</math></center><br />
<br />
where <math>\Sigma</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=residual_Component_Analysis:_Generalizing_PCA_for_more_flexible_inference_in_linear-Gaussian_models&diff=22934residual Component Analysis: Generalizing PCA for more flexible inference in linear-Gaussian models2013-08-15T21:23:58Z<p>Lxin: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
Probabilistic principle component analysis (PPCA) decomposes the covariance of a data vector <math> y</math> in <math>\mathbb{R}^p</math>, into a low-rank term and a spherical noise term. <center><math>y \sim \mathcal{N} (0, WW^T+\sigma I )</math></center> <math>W \in \mathbb{R}^{p \times q}</math> such that <math>q < p-1</math> imposes a reduced rank structure on the covariance. The log-likelihood of the centered dataset <math>Y</math> in <math>\mathbb{R}^{n \times p}</math> with n data points and p features<center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{i,:}|0, WW^T+\sigma^2 I)</math></center> can be maximized<ref name="tipping1999"><br />
Tipping, M. E. and Bishop, C.M. Probabilistic principle component analysis. Journal of the Royal Statistical Society. Series B(Statistical Methodology), 61(3):611-622,1999<br />
</ref> with the result <center><math>W_{ML} = U_qL_qR^T</math></center> <br />
<br />
where <math>U_q</math> are <math>q</math> principle eigenvectors of the sample covariance <math>\tilde S</math>, with <math>\tilde S = n^{-1}Y^TY</math> and <math>L^q</math> is a diagonal matrix with elements <math>l_{i,i} = (\lambda_i - \sigma^2)^{1/2}</math>, where <math>\lambda_i</math> is the ith eigenvalue of the sample covariance and <math>\sigma^2</math> is the noise variance. This max-likelihood solution is rotation invariant; <math>R</math> is an arbitrary rotation matrix. The matrix <math>W</math> spans the principle subspace of the data and the model is known as probabilistic PCA.<br />
<br />
The underlying assumption of the model is that the data set can be represented by <math>Y = XW^T+E</math> where <math>X</math> in <math>\mathbb{R}^{n \times p}</math> is a matrix of <math>q</math> dimensional latent variables and <math>E</math> is a matrix of noise variables <math> e_{ij} \sim \mathcal{N} (0,\sigma^2)</math>. The marginal log-likelihood above is obtained by placing an isotropic prior independently on the elements of <math>X</math> with <math>x_{ij} \sim \mathcal{N}(0,1)</math>.<br />
<br />
It is shown<ref name="lawerence2005"><br />
Lawrence N.D. Probabilistic non-linear principle component analysis with Gaussian process latent variable models. Journal of Machine Learning . MIT Press, Cambridge, MA, 2006. <br />
</ref> that the PCA solution is also obtained for log-likelihoods of the form <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\sigma^2 I)</math></center> This is recovered when we marginalize the loadings <math>W</math>, instead of latent variable <math>X</math>, with a Gaussian isotropic prior. This is the dual form of probabilistic PCA. This is analogous to the Dual form of PCA and similarly to the primal form, the max likelihood solution solves for the latent coordinates <math>X_{ML} = U^'_q L_qR^T</math>, instead of the principle subspace basis. Here, <math>U^'_q</math> are the first <math>q</math> principle eigenvectors of the inner product matrix <math>p^{-1}YY^T</math> with <math>Lq</math> define as before. Both primal and dual scenarios involve maximizing likelihoods of a similar covariance structure, namely when the covariance of the Gaussians is given by a low-rank term plus a spherical term. This paper considers a more general form <center><math>XX^T+\Sigma</math></center> Where <math>\Sigma</math> is a general positive definite matrix. The log-likelihood of this general problem is given by <center><math> ln p(Y) = \sum_{j=1}^p ln \mathcal{N} (y_{:,j}|0, XX^T+\Sigma)</math>......(*)</center> where <math>\Sigma = ZZ^T+ \sigma^2I</math>. <br />
<br />
The underlying model of this log-liklihood function can be considered as a linear mixed effect model with two factors and noise, <center><math>Y = XW^T+ZV^T+E</math></center> where <math> Z</math> is a matrix of known covariates and <math>X</math> is a matrix of latent variables.<br />
<br />
<br />
The question this papers attempts to answer is that given <math>\Sigma</math> how can we solve for <math>X</math>( respectively <math>W</math>), and for what values of <math>\Sigma</math> we can formulate useful new algorithm for machine learning? This paper shows that the maximum likelihood solution for <math>X</math> is simply based on generalized eigenvalue problem (GEP) on the sample-covariance matrix. Hence the low-rank term <math>XX^T</math> can be optimized for general <math>\Sigma</math>. This authors call this approach residual component analysis (RCA).<br />
<br />
==Maximum likelihood RCA==<br />
<br />
<br />
'''Theorem:''' The maximum likelihood estimate of the parameter <math>X</math> in the likelihood model in equation (*), for positive-definite and invertible <math>\Sigma</math>, is<br />
<math>X_{ML} = \Sigma S(D-I)^{1/2}</math> where <math>S</math> is the solution to the generalized eigenvalue problem with its columns as the generalised eigenvectors and <math>D</math> is diagonal with the corresponding generalized eigenvalues.<br />
<br />
The RCA log-likelihood is given by<center> <math>L(X,\Sigma) = -(p/2)ln |K| - (1/2) tr(YY^TK^{-1})-(np/2)ln(2\pi)</math></center> <br />
<br />
Where <math>K=XX^T+\Sigma</math>. Since <math>\Sigma</math> is positive-deifinite we can consider the eigen-decompostion on <math>\Sigma</math> the calculate the projection of the covariance on to this eigen-basis, scaled by the eigenvalues gives <math>\hat K = \Lambda^{-1/2}U^TXX^TU\Lambda^{-1/2} +I</math>. <br />
<br />
The maximum likelihood of the RCA can be re-written as <center> <math>L(\hat X) = -(p/2)ln(|K| |\Lambda|) - (1/2) tr(\hat Y \hat Y^T \hat K^{-1})-(np/2)ln(2\pi)</math></center><br />
<br />
Then solve the maximum likelihood solution of for <math>\hat X</math>. Relating the stationary point of<math>\hat X</math> to the solution for <math>X</math>and then we proceed by expressing this eigenvalue problem in terms of <math>YY^T</math>. Eventually we can recover X up to an arbitrary rotation (R, which for convenience is normally set to I), via the first q generalised eigenvectors of<math>(1/q)YY^T</math>,<br />
<br />
<center> <math>X = TL = \Sigma SL=\Sigma S(D-I)^{1/2}</math></center><br />
<br />
Aside from <math>\Sigma</math>, we note a subtle difference from the PPCA solution for <math>W</math>: Whereas PPCA explicitly subtracts the noise variance from the <math>q</math> retained principal eigenvalues, RCA implicitly incorporates any noise terms into <math>\Sigma</math> and standardises them when it projects the total covariance onto the eigen-basis of <math>\Sigma</math>. Thus we get a reduction of unity from the retained generalised eigenvalues from the theorem. For <math>\Sigma=I</math> the two solutions are identical.<br />
<br />
The posterior density for the RCA probabilistic model (primal case) and <math>\mu_y=0</math>. <center><math> x|y \sim \mathcal N (\Sigma_{ML} W_{ML}^T \Sigma^{-1}y, \Sigma_{x|y}),</math></center><br />
<br />
where <math> \Sigma = (W^T_{ML} \Sigma^{-1}W_{ML} + I)^{-1}</math><br />
<br />
==Low Rank Plus Sparse Inverse ==<br />
[[File:HSSexperiment3.png]] <br />
<br />
The graphical model optimised by the EM/RCA hybrid algorithm.<center><math>y|x,z \sim \mathcal N( Wx+z, \sigma^2I),</math></center><br />
<br />
<center><math>x \sim \mathcal N(0,I), z \sim \mathcal N(0, \Sigma^{-1})</math></center><br />
<br />
where <math>\Sigma</math> is sampled from a Laplace prior density, <center><math>p(\Lambda) \sim exp(-\lambda \| \Lambda \|_1).</math></center><br />
<br />
Marginalizing <math>X</math>, yields <br />
<center><math>log p (Y, \Lambda) = \Sigma^{n}_{i=1} log{\mathcal N(y_{i,:}|0, WW^T+\Sigma_{GL})p(\Lambda)} \ge \int q(Z)log \frac{p(Y,Z,\Lambda)}{q(Z)}dZ</math></center> <br />
Where <math>q(Z)</math> is the variational distribution and <math> \Sigma = \Lambda^{-1}+ \sigma^2I</math>, which we wish to optimise for some known <math>W</math>. This is an intractable problem, so instead we optimized the lower bound in an EM fashion.<br />
<br />
'''E-step''': Replacing <math>q(Z)</math> with the posterior <math>p(Z|Y,\Lambda ')</math> for a current estimate <math>\Lambda '</math>, amounts to teh E-step for updating the posterior density of <math>\,z_n|y_n</math> with <center><math>\,cov[z|y] = ((WW^T+ \sigma^2I)^{-1} +\Sigma ')^{-1}</math></center><br />
<br />
<center><math>\langle z|y \rangle = cov[z|y]((WW^T+ \sigma^2I)^{-1} y_n</math></center><br />
<center><math>\langle z_nz_n^T \rangle = cov[z|y] + \langle z_n \rangle \langle z_n \rangle^T</math></center><br />
<br />
'''M-step''': Then for fixed <math>Z'</math>, the only free parameter in the expected complete data log likelihood <math>Q = E_{Z|Y} (log p(Z', \Lambda))</math> is <math>\Lambda</math>. Therefore, <math> argmax_{\Lambda} Q</math>. This amounts to standard GLASSO optimization with covariance matrix. <br />
<br />
'''RCA-step''': After one interation of EM, we update <math>W</math> via RCA based on the newly estimated <math>\Lambda</math>, <br />
<br />
<center><math>\,W= \Sigma S(D-I)^{1/2}</math></center> for the generalized eigen-value problem. <center><math>\frac{1}{n}Y^TYS = \Sigma SD </math> and <math>\, \Lambda = \Lambda^{-1}+\sigma^2I</math></center><br />
<br />
Iterate until the lower-bound converge.<br />
<br />
==Experiments ==<br />
We describe three experiments with EM/RCA and one purely with RCA analysing the residual left from a Gaussian process (GP) in a time-series. <br />
<br />
The four experiments with EM/RCA are as following:<br />
<br />
'''Experiment (1)''', simulation: the authors consider an artificial dataset sampled from the generative model to illustrate the effects of confounders on the estimation of the sparse-inverse covariance.<br />
<br />
[[Image:HSSexperiment1.png]]<br />
<br />
Figure (a) shows the precision-recall curve for GLASSO and EM/RCA. The EM/RCA curve shows significantly better performance than GLASSO on the confounded data, while the dashed line shows the performance of GLASSO on similarly generated data without the confounding effects <math>(W = 0)</math>. We note that EM/RCA performs better on confounded data than GLASSO on non-confounded data, because of the lower signal-to-noise ratio in the non-confounded data.<br />
<br />
'''Experiment (2)''', reconstruction of a biological network, we applied EM/RCA on the protein-signaling data of <ref name="Sachs2008"><br />
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523– 529, April 2008.<br />
</ref>. Figure (b), EM/RCA performs slightly better than all other methods. Figure 4 shows the reconstructed networks for recall 0.4. We note that EM/RCA is more conservative in calling edges.<br />
<br />
'''Experiment (3)''', reconstruction of human form, the objective here is to reconstruct the underlying connectivity of a human being, given only the 3 dimensional locations of 31 sensors placed about the figures body. The aim is to construct a model which recovers connectivity between these points. EM/RCA method also showed promising result. <br />
<br />
Figure 5 shows the comparison in the form of recall/precision curves between GLASSO and the EM/RCA implementation of a sparse-inverse plus lowrank model. As can be seen, the EM/RCA algorithm outperforms the GLASSO. The recovered stickmen of EM/RCA and GLASSO are shown in figure 6.<br />
<br />
[[Image:GarciaF21.jpg]] [[Image:GarciaF22.jpg]]<br />
<br />
<br />
<br />
'''Experiment (4)''', differences in gene-expression profiles, the authors applied the RCA method to address the common challenge in data analysis is to summarize the difference between treatment and control samples. Assuming that both time-series are identical, implies <math>y^T = (y_1^T y_2^T)</math> can be modeled by a Gaussian process (GP) with a temporal covariance function, y is distributed as N(0, K), where <math>K \in R^{n\times n}</math> for <math>n=n_1+n_2</math>is structured such that both y1 and y2 are generated from the same function. A RBF kernel is used.<br />
<br />
[[Image:HSSexperiment2.png]]<br />
<br />
For the figure above, (a)RBF kernel computed on augmented time- input vectors of gene-expression. The kernel is computed across times <math>(0 : 20 : 240, 0, 20, 40, 60, 120, 180, 240)</math>, jointly for control and treatment. (b) shows the ROC curves of RCA and BATS variants with different noise models. We note that RCA outperforms BATS in terms the area under the ROC curve for all of its noise models.<br />
<br />
==Discussion==<br />
RCA is an algorithm for describing a low-dimensional representation of the residuals of a data set, given partial explanation by a covariance matrix <math>\Sigma</math>.The low-rank component of the model can be determined through a generalized eigenvalue problem. The paper illustrated how a treatment and a control time series could have their differences highlighted through appropriate selection of <math>\Sigma</math>(in this case we used an RBF kernel). The paper also introduced an algorithm for fitting a variant of CCA where the private spaces are explained through low dimensional latent variables.<br />
<br />
Full covariance matrix model is often run into problem as their parameterization scales with <math>D^2</math>. This technique combined sparse-inverse covariance (as in GLASSO) with low rank (as in probabilistic PCA) approaches, and have good effect in the experiment. It was demonstrated to good effect in a motion capture and protein network example.<br />
<br />
==References==<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=measuring_statistical_dependence_with_Hilbert-Schmidt_norms&diff=22933measuring statistical dependence with Hilbert-Schmidt norms2013-08-15T18:11:12Z<p>Lxin: /* Estimator of HSIC */</p>
<hr />
<div>This is another very popular kernel-based approach fro detecting dependence which is called HSIC(Hilbert-Schmidt Independence Criteria). It's based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces(RKHSs).This approach is simple and no user-defined regularisation is needed. Exponential convergence is guaranteed, so convergence is fast. <br />
<br />
== Background ==<br />
Before the proposal of HSIC, there are already a few kernel-based independence detecting methods. Bach[] proposed a regularised correlation operator which is derived from the covariance and cross-covariance operators, and its largest singular value was used as a static to test independence. Gretton et al.[] used the largest singular value of the cross-covariance operator which resulted constrained covariance(COCO). HSIC is a extension of the concept COCO by using the entire spectrum of cross-covariance operator to determine when all its singular values are zero rather than just looking the largest singular value. The HSIC resolves the question regarding the link between the quadratic dependence measure and kernel dependence measures based on RKHSs, and generalizes the measure to metric spaces. <br />
<br />
== Cross-Covariance Operators ==<br />
'''Hilbert-Schmidt Norm'''. Denote by <math>\mathit{C}:\mathcal{G}\to\mathcal{F}</math> a linear operator. Provided the sum converges, the HS norm of <math>\mathit{C}</math> is defined as<br />
<br />
<math>||\mathit{C}||^2_{HS}:=\sum_{i,j}<\mathit{C}v_i,u_j>_\mathcal{F}^2</math><br />
<br />
Where <math>v_i,u_j</math> are orthonormal bases of <math>\mathcal{G}</math> and <math>\mathcal{F}</math> respectively.<br />
<br />
'''Hilbert-Schmidt Operator''' is defined based on the definition of Hilbert Schmidt norm as<br />
<br />
<math><\mathit{C},\mathit{D}>{HS}:=\sum_{i,j}<\mathit{C}v_i,u_j>_\mathcal{F}<\mathit{D}v_i,u_j>_\mathcal{F}</math><br />
<br />
'''Tensor Product'''. Let <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math>. The tensor product operator <math>f\otimes g:\mathcal{G}\to \mathcal{F}</math> is defined as<br />
<br />
<math>(f\otimes g)h:=f<g,h>_\mathcal{G}</math> for all <math>h\in \mathcal{G}</math><br />
<br />
'''Cross-Covariance Operator''' associated with the joint measure <math>p_{x,y}</math> on <math>(\mathcal{X}\times\mathcal{Y},\Gamma\times\Lambda)</math> is a linear operator <math>C_{xy}:\mathcal{G}\to \mathcal{F}</math> defined as <br />
<br />
<math> C_{xy}:=E_{x,y}[(\theta (x)-\mu_x)\otimes (\psi (y)-\mu_y)]=E_{x,y}[\theta (x)\otimes \psi (y)]-\mu_x\otimes\mu_y</math><br />
<br />
<br />
== Hilbert-Schmidt Independence Criterion ==<br />
Given separable RKHSs <math>\mathcal{F},\mathcal{G}</math> and a joint measure <math>p_{xy}</math> over <math>(\mathcal{X}\times\mathcal{Y},\Gamma\times\Lambda)</math>, HSIC is defined as the squared HS-norm of the associated cross-covariance operator <math>C_{xy}</math>:<br />
<br />
<math>HSIC(p_{xy},\mathcal{F},\mathcal{G}):=||C_{xy}||_{HS}^2</math><br />
<br />
According to Gretton et al., the largest singular value <math>||C_{xy}||_{HS}=0</math> if and only if x and y are independent. Since <math>||C_{xy}=0||_S</math> if and only if <math>||C_{xy}||_{HS}=0</math>, so <math>||C_{xy}||_{HS}=0</math> if and only if x and y are independent. Therefore, HSIC can be used as a independence criteria. <br />
<br />
<br />
== Estimator of HSIC ==<br />
Let <math>Z:={(x_1,y_1),\dots,(x_m,y_m)}\subseteq \mathcal{X}\times\mathcal{Y}</math> be a series of m independent observations drawn from <math>p_{xy}</math>. An empirical estimator of HSIC, written HSIC(Z,\mathcal{F},\mathcal{G}) is given by<br />
<br />
<math>HSIC(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}trKHLH</math><br />
<br />
where <math>H,K,L\in \mathbb{R}^{m\times m},K_{ij}:=k(x_i,x_j),L_{i,j}:=l(y_i,y_j) and H_{ij}:=\delta_{ij}-m^{-1}</math>.<br />
It can be proved that the bias of the empirical HSIC is at <math>\mathit{O}(m^{-1})</math>. It can be further shown that the deviation between <math>HSIC(Z,\mathcal{F},\mathcal{G})</math> and its expectation is not too large.<br />
<br />
== Independence Tests ==<br />
Demixing of n randomly chosen i.i.d. samples of length m, where n varies from 2 to 16. The Gaussian kernel results are denoted g, and the Laplace kernel results l. The column Rep. gives the number of runs over which the average performance was measured. Note that some algorithm names are truncated: Fica is Fast ICA, IMAX is Infomax, RAD is RADICAL, CFIC is CFICA, CO is COCO, and HS is HSIC. Performance is measured using the Amari divergence (smaller is bettter).<br />
<br />
[[File:hsic_table1.png]]<br />
<br />
In the experiment, HSIC with a Gaussian kernel performs on par with the best alternatives in the final four experiments, and that HSIC with a Laplace kernel gives joint best performance in six of the seven experiments. On the other hand, RADICAL and the KGV perform better than HSIC in the m = 250 case. <br />
<br />
[[File:Hsic_graph1.png]]<br />
<br />
Left: Effect of outliers on the performance of the ICA algorithms. Each point represents an average Amari divergence over 100 independent experiments (smaller is better). The number of corrupted observations in both signals is given on the horizontal axis. Right: Performance of the KCC and KGV as a function of κ for two sources of size m = 1000, where 25 outliers were added to each source following the mixing procedure.<br />
<br />
HSIC is much more robust to outliers in comparing to other methods. This method doesn't need regularisation, while KGV and KCC is sensitive to regulariser scale.<br />
<br />
== Summary ==<br />
By using the HS norm of the cross-covariance, HSIC is simpler to define, requires no regularisation or tuning beyond kernel selection. It's robust to noise and efficient compared to other methods. The performance meets or exceeds the best alternative on all data sets besides the m=250 case.<br />
<br />
== References ==<br />
[1] Gretton, Arthur, et al. "Measuring statistical dependence with Hilbert-Schmidt norms." Algorithmic learning theory. Springer Berlin Heidelberg, 2005.<br />
<br />
[2] Fukumizu, Kenji, Francis R. Bach, and Michael I. Jordan. "Kernel dimension reduction in regression." The Annals of Statistics 37.4 (2009): 1871-1905.<br />
<br />
[3] Bach, Francis R., and Michael I. Jordan. "Kernel independent component analysis." The Journal of Machine Learning Research 3 (2003): 1-48.<br />
<br />
[4] Baker, Charles R. "Joint measures and cross-covariance operators." Transactions of the American Mathematical Society 186 (1973): 273-289.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=measuring_statistical_dependence_with_Hilbert-Schmidt_norms&diff=22932measuring statistical dependence with Hilbert-Schmidt norms2013-08-15T16:54:58Z<p>Lxin: </p>
<hr />
<div>This is another very popular kernel-based approach fro detecting dependence which is called HSIC(Hilbert-Schmidt Independence Criteria). It's based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces(RKHSs).This approach is simple and no user-defined regularisation is needed. Exponential convergence is guaranteed, so convergence is fast. <br />
<br />
== Background ==<br />
Before the proposal of HSIC, there are already a few kernel-based independence detecting methods. Bach[] proposed a regularised correlation operator which is derived from the covariance and cross-covariance operators, and its largest singular value was used as a static to test independence. Gretton et al.[] used the largest singular value of the cross-covariance operator which resulted constrained covariance(COCO). HSIC is a extension of the concept COCO by using the entire spectrum of cross-covariance operator to determine when all its singular values are zero rather than just looking the largest singular value. The HSIC resolves the question regarding the link between the quadratic dependence measure and kernel dependence measures based on RKHSs, and generalizes the measure to metric spaces. <br />
<br />
== Cross-Covariance Operators ==<br />
'''Hilbert-Schmidt Norm'''. Denote by <math>\mathit{C}:\mathcal{G}\to\mathcal{F}</math> a linear operator. Provided the sum converges, the HS norm of <math>\mathit{C}</math> is defined as<br />
<br />
<math>||\mathit{C}||^2_{HS}:=\sum_{i,j}<\mathit{C}v_i,u_j>_\mathcal{F}^2</math><br />
<br />
Where <math>v_i,u_j</math> are orthonormal bases of <math>\mathcal{G}</math> and <math>\mathcal{F}</math> respectively.<br />
<br />
'''Hilbert-Schmidt Operator''' is defined based on the definition of Hilbert Schmidt norm as<br />
<br />
<math><\mathit{C},\mathit{D}>{HS}:=\sum_{i,j}<\mathit{C}v_i,u_j>_\mathcal{F}<\mathit{D}v_i,u_j>_\mathcal{F}</math><br />
<br />
'''Tensor Product'''. Let <math>f\in \mathcal{F}</math> and <math>g\in \mathcal{G}</math>. The tensor product operator <math>f\otimes g:\mathcal{G}\to \mathcal{F}</math> is defined as<br />
<br />
<math>(f\otimes g)h:=f<g,h>_\mathcal{G}</math> for all <math>h\in \mathcal{G}</math><br />
<br />
'''Cross-Covariance Operator''' associated with the joint measure <math>p_{x,y}</math> on <math>(\mathcal{X}\times\mathcal{Y},\Gamma\times\Lambda)</math> is a linear operator <math>C_{xy}:\mathcal{G}\to \mathcal{F}</math> defined as <br />
<br />
<math> C_{xy}:=E_{x,y}[(\theta (x)-\mu_x)\otimes (\psi (y)-\mu_y)]=E_{x,y}[\theta (x)\otimes \psi (y)]-\mu_x\otimes\mu_y</math><br />
<br />
<br />
== Hilbert-Schmidt Independence Criterion ==<br />
Given separable RKHSs <math>\mathcal{F},\mathcal{G}</math> and a joint measure <math>p_{xy}</math> over <math>(\mathcal{X}\times\mathcal{Y},\Gamma\times\Lambda)</math>, HSIC is defined as the squared HS-norm of the associated cross-covariance operator <math>C_{xy}</math>:<br />
<br />
<math>HSIC(p_{xy},\mathcal{F},\mathcal{G}):=||C_{xy}||_{HS}^2</math><br />
<br />
According to Gretton et al., the largest singular value <math>||C_{xy}||_{HS}=0</math> if and only if x and y are independent. Since <math>||C_{xy}=0||_S</math> if and only if <math>||C_{xy}||_{HS}=0</math>, so <math>||C_{xy}||_{HS}=0</math> if and only if x and y are independent. Therefore, HSIC can be used as a independence criteria. <br />
<br />
<br />
== Estimator of HSIC ==<br />
Let <math>Z:={(x_1,y_1),\dots,(x_m,y_m)}\subseteq \mathcal{X}\times\mathcal{Y}</math> be a series of m independent observations drawn from <math>p_{xy}</math>. An empirical estimator of HSIC, written HSIC(Z,\mathcal{F},\mathcal{G}) is given by<br />
<br />
<math>HSIC(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}trKHLH</math><br />
<br />
where <math>H,K,L\in \mathbb{R}^{m\times m},K_{ij}:=k(x_i,x_j),L_{i,j}:=l(y_i,y_j) and H_{ij}:=\delta_{ij}-m^{-1}</math>.<br />
It can be proved that the bias of the empirical HSIC is at <math>\mathit{O}(m^{-1})</math>.<br />
<br />
<br />
== Independence Tests ==<br />
Demixing of n randomly chosen i.i.d. samples of length m, where n varies from 2 to 16. The Gaussian kernel results are denoted g, and the Laplace kernel results l. The column Rep. gives the number of runs over which the average performance was measured. Note that some algorithm names are truncated: Fica is Fast ICA, IMAX is Infomax, RAD is RADICAL, CFIC is CFICA, CO is COCO, and HS is HSIC. Performance is measured using the Amari divergence (smaller is bettter).<br />
<br />
[[File:hsic_table1.png]]<br />
<br />
In the experiment, HSIC with a Gaussian kernel performs on par with the best alternatives in the final four experiments, and that HSIC with a Laplace kernel gives joint best performance in six of the seven experiments. On the other hand, RADICAL and the KGV perform better than HSIC in the m = 250 case. <br />
<br />
[[File:Hsic_graph1.png]]<br />
<br />
Left: Effect of outliers on the performance of the ICA algorithms. Each point represents an average Amari divergence over 100 independent experiments (smaller is better). The number of corrupted observations in both signals is given on the horizontal axis. Right: Performance of the KCC and KGV as a function of κ for two sources of size m = 1000, where 25 outliers were added to each source following the mixing procedure.<br />
<br />
HSIC is much more robust to outliers in comparing to other methods. This method doesn't need regularisation, while KGV and KCC is sensitive to regulariser scale.<br />
<br />
== Summary ==<br />
By using the HS norm of the cross-covariance, HSIC is simpler to define, requires no regularisation or tuning beyond kernel selection. It's robust to noise and efficient compared to other methods. The performance meets or exceeds the best alternative on all data sets besides the m=250 case.<br />
<br />
== References ==<br />
[1] Gretton, Arthur, et al. "Measuring statistical dependence with Hilbert-Schmidt norms." Algorithmic learning theory. Springer Berlin Heidelberg, 2005.<br />
<br />
[2] Fukumizu, Kenji, Francis R. Bach, and Michael I. Jordan. "Kernel dimension reduction in regression." The Annals of Statistics 37.4 (2009): 1871-1905.<br />
<br />
[3] Bach, Francis R., and Michael I. Jordan. "Kernel independent component analysis." The Journal of Machine Learning Research 3 (2003): 1-48.<br />
<br />
[4] Baker, Charles R. "Joint measures and cross-covariance operators." Transactions of the American Mathematical Society 186 (1973): 273-289.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=rOBPCA:_A_New_Approach_to_Robust_Principal_Component_Analysis&diff=22931rOBPCA: A New Approach to Robust Principal Component Analysis2013-08-15T16:27:41Z<p>Lxin: /* Examples */</p>
<hr />
<div>=Introduction=<br />
<br />
Principal component analysis (PCA) is a useful tool in statistical learning, which tries to preserve the variability by a small number of principal components. In the classical method, the principal components are chosen as the eigenvectors corresponding to the top several largest eigenvalues of the covariance matrix. Since the classical estimation for covariance matrix is very sensitive to the presence of outliers, it is not surprising that the principal components are also attracted toward outlying points very easily, and no longer reflect the variation of regular data points correctly.<br />
<br />
To overcome this drawback, two types of modification are proposed. The first is to simply replace the covariance matrix estimator by a robust estimator in classical PCA. Related work includes Maronna <ref>Maronna, R. A. Robust M-Estimators of Multivariate Location and Scatter. The Annals of Statistics, 4:51-67, 1976. </ref>, Campbell <ref>Campbell, N. A. Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation. Applied Statistics, 29:231-237, 1980. </ref> and Croux and Haesbroeck <ref>Croux, C. and Haesbroeck, G. Principal Components Analysis based on Robust Estimators of the Covariance or Correlation matrix: Influence Functions and Efficiencies. Biometrika, 87:603-618, 2000. </ref>. But these methods only work nicely when the data are not in high-dimensional space, and the computation cost for these robust estimators will become a serious issue when dimension increases (can only handle up to about 100 dimensions).<br />
<br />
The second way is to use projection pursuit (PP) techniques (See Li and Chen <ref>Li, G., and Chen, Z. Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. Journal of the American Statistical Association, 80:759-766, 1985. </ref>, Croux and Ruiz-Gazen <ref>Croux, C., and Ruiz-Gazen, A. A Fast Algorithm for Robust Principal Components Based on Projection Pursuit. COMPSTAT 1996, Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-Verlag, 211-217, 1996. </ref>). PP obtains the robust principal components by maximize a robust measure of spread.<br />
<br />
The authors proposed a new approach called '''ROBPCA''', which combines the idea of PP and robust scatter matrix estimation. ROBPCA can be computed efficiently, and is able to detect exact-fit situations. Also, it can be used as a diagnostic plot that detects the outliers. <br />
<br />
=ROBPCA=<br />
<br />
The ROBPCA roughly consists of a three step algorithm. First, the data are transformed into a subspace whose dimension is at most <math>n-1</math>. Second, a preliminary covariance matrix is constructed and used for selecting a <math>k_{0}</math>-dimensional subspace that fits the data well. The final step is to project the data into the selected subspace where their location and scatter matrix are robustly estimated, until getting the final score in the <math>k</math>-dimensional subspace.<br />
<br />
<br />
'''Notations''':<br />
<br />
<math>\mathbf{X}_{n,p}</math>: The observed data, <math>n</math> objects and <math>p</math> variables.<br />
<br />
<math>\widehat{\mu}_{0}^{\prime}</math>: mean vector of <math>\mathbf{X}_{n,p}</math>.<br />
<br />
<math>k</math>: the dimension of low-dimensional subspace into which the data are projected.<br />
<br />
<math>r_{0}</math>: Rank of <math>\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}</math>.<br />
<br />
<math>\alpha</math>: tuning parameter that represents the robustness of the procedure.<br />
<br />
<math>t_{MCD}</math> and <math>s_{MCD}</math>: MCD location and scale estimator<ref>Rousseeuw, P. J. Least Median of Squares Regression. Journal of the American Statistical Association, 79:871-880, 1984</ref><br />
<br />
==Detailed ROBPCA algorithm==<br />
<br />
'''Step 1'''<br />
<br />
ROBPCA starts with finding a affine subspace spanned by n data points (as propose by Hubert et al. <ref name="HR">Hubert, M., Rousseeuw, P. J., and Verboven, S. A Fast Method for Robust Principal Components With Applications to Chemometrics. Chemometrics and Intelligent Laboratory Systems, 60:101-111, 2002. </ref>). This is done by performing the SVD:<br />
<br />
<center><math>\,\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}=U_{n,r_{0}}D_{r_{0},r_{0}}V_{r_{0},p}^{\prime}</math></center><br />
<br />
Without losing any information, we can now work in the subspace spanned by the <math>r_{0}</math> columns of <math>V</math>. Thus, <math>\,\mathbf{Z}_{n,r_{0}}=UD</math> becomes the new data matrix.<br />
<br />
----<br />
<br />
'''Step 2'''<br />
<br />
The second step is to find a subset of <math>h<n</math> least outlying data points, and use their covariance matrix to obtain a subspace of dimension <math>k_{0}</math>. The value of <math>h</math> is chosen as<br />
<br />
<center><math>h=\max \left\{ \alpha n, (n+k_{max}+1)/2 \right\}</math></center><br />
where <math>k_{max}</math> represents the maximal number of components that will be computed.<br />
<br />
Then the subset of least outlying data points is found as the following:<br />
<br />
1. For each data point <math>\mathbf{x}_{i}</math> and each direction <math>\mathbf{v}</math>, the '''orthogonally invariant outlyingness'''<br />
is computed:<br />
<center><math>outl_{O}(\mathbf{x}_{i})=\max_{\mathbf{v}} \frac{\left| \mathbf{x}_{i}^{\prime}\mathbf{v}-t_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v}) \right|}{s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})}</math></center><br />
For a direction <math>\mathbf{v}</math> such that <math>s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})=0</math>, we found a hyperplane orthogonal to <math>\mathbf{v}</math> that contains <math>h</math> observations, therefore reducing the dimension by one.<br />
<br />
Repeat searching until we end up with a dataset in some lower-dimensional space and a set <math>H_{0}</math> indexing the <math>h</math> data points with smallest outlyingness.<br />
<br />
2. Compute the empirical mean <math>\widehat{\mu}_{1}</math> and covariance matrix <math>S_{0}</math> of <math>h</math> points in <math>H_{0}</math>. Perform the spectral decomposition of <math>S_{0}</math>.<br />
<br />
3. Project the data points on the subspace spanned by the first <math>k_{0}</math> eigenvectors of <math>S_{0}</math>, and get the new dataset <math>\mathbf{X}_{n,k_{0}}^{\star}</math><br />
<br />
----<br />
<br />
'''Step 3'''<br />
<br />
The mean and covariance matrix of <math>\mathbf{X}_{n,k_{0}}^{\star}</math> are robustly estimated by FAST-MCD algorithm<ref name="HR">Reference</ref>, and during the iteration procedure, one can keep reducing the dimensionality when the covariance matrix is found to be singular.<br />
<br />
Repeating the FAST-MCD until getting the final dataset <math>\mathbf{X}_{n,k} \in \mathbb{R}^{k}</math>, and the scores <math>\mathbf{T}_{n,k}</math>:<br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,k}-1_{n}\widehat{mu}_{k}^{\prime})\mathbf{P}</math></center><br />
<br />
Finally, <math>\mathbf{P}</math> is transformed back into <math>\mathbb{R}^{p}</math> to obtain the robust principal components <math>\mathbf{P}_{p,k}</math> such that <br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,p}-1_{n}\widehat{mu}^{\prime})\mathbf{P}_{p,k}</math></center><br />
<br />
Moreover, a robust scatter matrix <math>\mathbf{S}</math> of rank k is also generated by<br />
<center><math>\mathbf{S}=\mathbf{P}_{p,k}\mathbf{L}_{k,k}\mathbf{P}_{p,k}{\prime}</math></center><br />
where <math>\mathbf{L}_{k,k}</math> is the diagonal matrix with eigenvalues <math>l_{1},\cdots,l_{k}</math><br />
<br />
----<br />
<br />
'''Remarks'''<br />
<br />
1. Step 1 is useful especially when the number of variables are larger than the sample size (<math>p>n</math>)<br />
<br />
2. In step 2, the choice of <math>\alpha</math> reflects the trade-off between efficiency and robustness, i.e. the higher<br />
the <math>\alpha</math>, the more efficient the estimates will be for uncontaminated data, and the lower the <math>\alpha</math>, the more robust the estimator will be for contaminated samples.<br />
<br />
3. Unlike some other robust PCA method, ROBPCA shares a very nice property with classical PCA: it is location and orthogonal equivariant.<br />
<br />
==Diagnostic==<br />
<br />
The ROBPCA can be also used to flag the outliers in the sample. Usually, we can roughly consider the points in the dataset to be in four types:<br />
<br />
1. ''regular''<br />
<br />
2. ''good leverage'': far from regular points, but lie close to the true subspace, as point 1 and 4 in figure 1.<br />
<br />
3. ''bad leverage'': far from regular points, and also have a large orthogonal distance to the true subspace, as point 2 and 3 in figure 1.<br />
<br />
4. ''orthogonal outliers'': have a large orthogonal distance to the true subspace, but close to the regular points if projected into the true subspace, as point 5 in figure 1.<br />
<br />
[[File:GarciaF11.jpg]]<br />
<br />
<br />
A ''diagnostic plot'' can be constructed to identify the types of each point as following:<br />
<br />
1. On the horizontal axis, the ''robust score distance'' <math>SD_{i}</math> of each observation are plotted.<br />
<center><math>SD_{i}=\sqrt{\sum_{j=1}^{k}\frac{t_{ij}^{2}}{l_{j}}}</math></center><br />
where the scores <math>t_{ij}</math> and <math>l_{j}</math>are obtained from step 3.<br />
<br />
2. On the vertical axis, the ''orthogonal distance'' <math>OD_{i}</math> of each observation to the PCA subspace are plotted.<br />
<center><math>OD_{i}=\left\| \mathbf{x}_{i}-\widehat{mu}-\mathbf{P}_{p,k}\mathbf{t}_{i}^{\prime} \right\|</math></center><br />
<br />
Then one can draw two cutoff lines (on horizontal and vertical axis respectively), and divide the diagnostic plot into four zones. Points fall into the lower-left zone are ''regular'' points, lower-right are ''good leverage'' points, upper-left are ''orthogonal outlying'' points, and upper-right are ''bad leverage'' points.<br />
<br />
=Example and Simulations=<br />
<br />
The performances of ROBPCA and the diagnostic plot are illustrated by some real data example and simulation studies. The comparison is carried out between ROBPCA and other four types of PCA: classical PCA (CPCA), RAPCA<ref name="HR">Reference</ref>, spherical PCA (SPHER) and ellipsoidal PCA (ELL)<ref>Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T., and Cohen, K. L. Robust Principal Component Analysis for Functional Data. Test, 8:1-73, 1999</ref>, where the last three methods are also designed to be robust for high-dimensional data.<br />
<br />
==Examples==<br />
<br />
'''Glass data'''<br />
<br />
The ''Glass dataset'' consists of 180 glass samples (<math>n=180</math>), with 750 variables (<math>p=750</math>). <br />
<br />
Perform ROBPCA on this dataset with the choice of <math>h=126=0.7n</math>, and the dimensionality of the subspace <math>k=3</math>. The diagnostic plot is shown as following. Clearly, ROBPCA distinguishes a small group of bad leverage points which all three other PCA methods fails to recognize. Moreover, through next figure, we can see ROBPCA identify the bad leverage points correctly.<br />
<br />
[[File:GarciaF12.jpg]]<br />
<br />
[[File:GarciaF13.jpg]]<br />
<br />
<br />
'''Car data'''<br />
<br />
The car data contains 111 observations with p=11 characteristics measured for each car. The first 2 principle components are chosen since they account for 94% total variance (ROBPCA). The following figures show the diagnostic plots of ROBPCA and CPCA. Although the same set of outliers are detected, the group of bad leverage points from ROBPCA are converted into good leverage points for CPCA.<br />
<br />
[[File:ROBPCA1.png|800px]]<br />
<br />
The scores are plotted in the figures below together with the 97.5% tolerance ellipse. Data points fall outside the ellipse are the good and bad leverage points. The ellipse of CPCA is highly inflated toward the outliers 25, 30, 32, 34 and 36. So the resulting eigenvectors are not lying in the direction of the highest variability of the rest points. The second eigenvalue of CPCA is also blown up by the outliers. In contrast, the tolerance ellipse of ROBPCA is more robust!<br />
<br />
[[File:ROBPCA2.png|850px]]<br />
<br />
==Simulations==<br />
<br />
In the simulation study, we generate 1000 samples of size n from the contamination model<br />
<center><math>(1-\epsilon)F_{p}(0,\Sigma)+\epsilon F_{p}(\tilde{\mu},\tilde{\Sigma})</math></center><br />
where <math>F_{p}</math> is p-variate Normal distribution or elliptical distribution.<br />
<br />
Different choices of <math>n,p,\epsilon,\Sigma,\tilde{\mu},\tilde{\Sigma}</math> are tried. The following tables and figures report some typical situations:<br />
<br />
1. <math>\,n=100,p=4,\Sigma=diag(8,4,2,1),k=3</math><br />
<br />
(1a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(1b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
2. <math>n=50,p=100,\Sigma=diag(17,13.5,8,3,1,0.095,\cdots,0.001),k=5</math><br />
<br />
(2a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(2b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
For each simulation setting, the results of four methods are summarized as following:<br />
<br />
1. For each method, consider the maximal angle between the true subspace and estimated subspace<ref>Krzanowski, W. J. RBetween-Groups Comparison of Principal Components. Journal of the American Statistical Association, 74:703-707, 1979</ref>:<br />
<center><math>maxsub=\frac{2}{\pi}arccos(\sqrt{\lambda_{k}})</math></center><br />
This is reported in '''table 1'''<br />
<br />
[[File:GarciaF14.jpg]]<br />
<br />
2. Consider the proportion of variability that is preserved by the estimated subspace.<br />
This is reported in '''table 2'''<br />
<br />
[[File:GarciaF15.jpg]]<br />
<br />
3. Consider the mean squared error (MSE) for the k largest eigenvalues<br />
<center><math>MSE(\widehat{\lambda}_{j})=\frac{1}{1000}\sum^{1000}_{l=1}(\widehat{\lambda}_{j}-\lambda_{j})^{2}</math></center><br />
The results for differnt settings are shown in the following figures.<br />
<br />
[[File:GarciaF16.jpg]]<br />
<br />
[[File:GarciaF17.jpg]]<br />
<br />
[[File:GarciaF18.jpg]]<br />
<br />
[[File:GarciaF19.jpg]]<br />
<br />
<br />
The last issue worth mentioning is the computation cost. The ROBPCA is slightly more computationaly expensive than other three methods compared above, but it is still acceptable. The following figure shows the mean CPU time in seconds over 100 runs for varying low-dimensional normal data<br />
<br />
[[File:GarciaF20.jpg]]<br />
<br />
=Reference=<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ROBPCA2.png&diff=22930File:ROBPCA2.png2013-08-15T16:26:34Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=rOBPCA:_A_New_Approach_to_Robust_Principal_Component_Analysis&diff=22929rOBPCA: A New Approach to Robust Principal Component Analysis2013-08-15T16:26:08Z<p>Lxin: /* Examples */</p>
<hr />
<div>=Introduction=<br />
<br />
Principal component analysis (PCA) is a useful tool in statistical learning, which tries to preserve the variability by a small number of principal components. In the classical method, the principal components are chosen as the eigenvectors corresponding to the top several largest eigenvalues of the covariance matrix. Since the classical estimation for covariance matrix is very sensitive to the presence of outliers, it is not surprising that the principal components are also attracted toward outlying points very easily, and no longer reflect the variation of regular data points correctly.<br />
<br />
To overcome this drawback, two types of modification are proposed. The first is to simply replace the covariance matrix estimator by a robust estimator in classical PCA. Related work includes Maronna <ref>Maronna, R. A. Robust M-Estimators of Multivariate Location and Scatter. The Annals of Statistics, 4:51-67, 1976. </ref>, Campbell <ref>Campbell, N. A. Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation. Applied Statistics, 29:231-237, 1980. </ref> and Croux and Haesbroeck <ref>Croux, C. and Haesbroeck, G. Principal Components Analysis based on Robust Estimators of the Covariance or Correlation matrix: Influence Functions and Efficiencies. Biometrika, 87:603-618, 2000. </ref>. But these methods only work nicely when the data are not in high-dimensional space, and the computation cost for these robust estimators will become a serious issue when dimension increases (can only handle up to about 100 dimensions).<br />
<br />
The second way is to use projection pursuit (PP) techniques (See Li and Chen <ref>Li, G., and Chen, Z. Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. Journal of the American Statistical Association, 80:759-766, 1985. </ref>, Croux and Ruiz-Gazen <ref>Croux, C., and Ruiz-Gazen, A. A Fast Algorithm for Robust Principal Components Based on Projection Pursuit. COMPSTAT 1996, Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-Verlag, 211-217, 1996. </ref>). PP obtains the robust principal components by maximize a robust measure of spread.<br />
<br />
The authors proposed a new approach called '''ROBPCA''', which combines the idea of PP and robust scatter matrix estimation. ROBPCA can be computed efficiently, and is able to detect exact-fit situations. Also, it can be used as a diagnostic plot that detects the outliers. <br />
<br />
=ROBPCA=<br />
<br />
The ROBPCA roughly consists of a three step algorithm. First, the data are transformed into a subspace whose dimension is at most <math>n-1</math>. Second, a preliminary covariance matrix is constructed and used for selecting a <math>k_{0}</math>-dimensional subspace that fits the data well. The final step is to project the data into the selected subspace where their location and scatter matrix are robustly estimated, until getting the final score in the <math>k</math>-dimensional subspace.<br />
<br />
<br />
'''Notations''':<br />
<br />
<math>\mathbf{X}_{n,p}</math>: The observed data, <math>n</math> objects and <math>p</math> variables.<br />
<br />
<math>\widehat{\mu}_{0}^{\prime}</math>: mean vector of <math>\mathbf{X}_{n,p}</math>.<br />
<br />
<math>k</math>: the dimension of low-dimensional subspace into which the data are projected.<br />
<br />
<math>r_{0}</math>: Rank of <math>\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}</math>.<br />
<br />
<math>\alpha</math>: tuning parameter that represents the robustness of the procedure.<br />
<br />
<math>t_{MCD}</math> and <math>s_{MCD}</math>: MCD location and scale estimator<ref>Rousseeuw, P. J. Least Median of Squares Regression. Journal of the American Statistical Association, 79:871-880, 1984</ref><br />
<br />
==Detailed ROBPCA algorithm==<br />
<br />
'''Step 1'''<br />
<br />
ROBPCA starts with finding a affine subspace spanned by n data points (as propose by Hubert et al. <ref name="HR">Hubert, M., Rousseeuw, P. J., and Verboven, S. A Fast Method for Robust Principal Components With Applications to Chemometrics. Chemometrics and Intelligent Laboratory Systems, 60:101-111, 2002. </ref>). This is done by performing the SVD:<br />
<br />
<center><math>\,\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}=U_{n,r_{0}}D_{r_{0},r_{0}}V_{r_{0},p}^{\prime}</math></center><br />
<br />
Without losing any information, we can now work in the subspace spanned by the <math>r_{0}</math> columns of <math>V</math>. Thus, <math>\,\mathbf{Z}_{n,r_{0}}=UD</math> becomes the new data matrix.<br />
<br />
----<br />
<br />
'''Step 2'''<br />
<br />
The second step is to find a subset of <math>h<n</math> least outlying data points, and use their covariance matrix to obtain a subspace of dimension <math>k_{0}</math>. The value of <math>h</math> is chosen as<br />
<br />
<center><math>h=\max \left\{ \alpha n, (n+k_{max}+1)/2 \right\}</math></center><br />
where <math>k_{max}</math> represents the maximal number of components that will be computed.<br />
<br />
Then the subset of least outlying data points is found as the following:<br />
<br />
1. For each data point <math>\mathbf{x}_{i}</math> and each direction <math>\mathbf{v}</math>, the '''orthogonally invariant outlyingness'''<br />
is computed:<br />
<center><math>outl_{O}(\mathbf{x}_{i})=\max_{\mathbf{v}} \frac{\left| \mathbf{x}_{i}^{\prime}\mathbf{v}-t_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v}) \right|}{s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})}</math></center><br />
For a direction <math>\mathbf{v}</math> such that <math>s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})=0</math>, we found a hyperplane orthogonal to <math>\mathbf{v}</math> that contains <math>h</math> observations, therefore reducing the dimension by one.<br />
<br />
Repeat searching until we end up with a dataset in some lower-dimensional space and a set <math>H_{0}</math> indexing the <math>h</math> data points with smallest outlyingness.<br />
<br />
2. Compute the empirical mean <math>\widehat{\mu}_{1}</math> and covariance matrix <math>S_{0}</math> of <math>h</math> points in <math>H_{0}</math>. Perform the spectral decomposition of <math>S_{0}</math>.<br />
<br />
3. Project the data points on the subspace spanned by the first <math>k_{0}</math> eigenvectors of <math>S_{0}</math>, and get the new dataset <math>\mathbf{X}_{n,k_{0}}^{\star}</math><br />
<br />
----<br />
<br />
'''Step 3'''<br />
<br />
The mean and covariance matrix of <math>\mathbf{X}_{n,k_{0}}^{\star}</math> are robustly estimated by FAST-MCD algorithm<ref name="HR">Reference</ref>, and during the iteration procedure, one can keep reducing the dimensionality when the covariance matrix is found to be singular.<br />
<br />
Repeating the FAST-MCD until getting the final dataset <math>\mathbf{X}_{n,k} \in \mathbb{R}^{k}</math>, and the scores <math>\mathbf{T}_{n,k}</math>:<br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,k}-1_{n}\widehat{mu}_{k}^{\prime})\mathbf{P}</math></center><br />
<br />
Finally, <math>\mathbf{P}</math> is transformed back into <math>\mathbb{R}^{p}</math> to obtain the robust principal components <math>\mathbf{P}_{p,k}</math> such that <br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,p}-1_{n}\widehat{mu}^{\prime})\mathbf{P}_{p,k}</math></center><br />
<br />
Moreover, a robust scatter matrix <math>\mathbf{S}</math> of rank k is also generated by<br />
<center><math>\mathbf{S}=\mathbf{P}_{p,k}\mathbf{L}_{k,k}\mathbf{P}_{p,k}{\prime}</math></center><br />
where <math>\mathbf{L}_{k,k}</math> is the diagonal matrix with eigenvalues <math>l_{1},\cdots,l_{k}</math><br />
<br />
----<br />
<br />
'''Remarks'''<br />
<br />
1. Step 1 is useful especially when the number of variables are larger than the sample size (<math>p>n</math>)<br />
<br />
2. In step 2, the choice of <math>\alpha</math> reflects the trade-off between efficiency and robustness, i.e. the higher<br />
the <math>\alpha</math>, the more efficient the estimates will be for uncontaminated data, and the lower the <math>\alpha</math>, the more robust the estimator will be for contaminated samples.<br />
<br />
3. Unlike some other robust PCA method, ROBPCA shares a very nice property with classical PCA: it is location and orthogonal equivariant.<br />
<br />
==Diagnostic==<br />
<br />
The ROBPCA can be also used to flag the outliers in the sample. Usually, we can roughly consider the points in the dataset to be in four types:<br />
<br />
1. ''regular''<br />
<br />
2. ''good leverage'': far from regular points, but lie close to the true subspace, as point 1 and 4 in figure 1.<br />
<br />
3. ''bad leverage'': far from regular points, and also have a large orthogonal distance to the true subspace, as point 2 and 3 in figure 1.<br />
<br />
4. ''orthogonal outliers'': have a large orthogonal distance to the true subspace, but close to the regular points if projected into the true subspace, as point 5 in figure 1.<br />
<br />
[[File:GarciaF11.jpg]]<br />
<br />
<br />
A ''diagnostic plot'' can be constructed to identify the types of each point as following:<br />
<br />
1. On the horizontal axis, the ''robust score distance'' <math>SD_{i}</math> of each observation are plotted.<br />
<center><math>SD_{i}=\sqrt{\sum_{j=1}^{k}\frac{t_{ij}^{2}}{l_{j}}}</math></center><br />
where the scores <math>t_{ij}</math> and <math>l_{j}</math>are obtained from step 3.<br />
<br />
2. On the vertical axis, the ''orthogonal distance'' <math>OD_{i}</math> of each observation to the PCA subspace are plotted.<br />
<center><math>OD_{i}=\left\| \mathbf{x}_{i}-\widehat{mu}-\mathbf{P}_{p,k}\mathbf{t}_{i}^{\prime} \right\|</math></center><br />
<br />
Then one can draw two cutoff lines (on horizontal and vertical axis respectively), and divide the diagnostic plot into four zones. Points fall into the lower-left zone are ''regular'' points, lower-right are ''good leverage'' points, upper-left are ''orthogonal outlying'' points, and upper-right are ''bad leverage'' points.<br />
<br />
=Example and Simulations=<br />
<br />
The performances of ROBPCA and the diagnostic plot are illustrated by some real data example and simulation studies. The comparison is carried out between ROBPCA and other four types of PCA: classical PCA (CPCA), RAPCA<ref name="HR">Reference</ref>, spherical PCA (SPHER) and ellipsoidal PCA (ELL)<ref>Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T., and Cohen, K. L. Robust Principal Component Analysis for Functional Data. Test, 8:1-73, 1999</ref>, where the last three methods are also designed to be robust for high-dimensional data.<br />
<br />
==Examples==<br />
<br />
'''Glass data'''<br />
<br />
The ''Glass dataset'' consists of 180 glass samples (<math>n=180</math>), with 750 variables (<math>p=750</math>). <br />
<br />
Perform ROBPCA on this dataset with the choice of <math>h=126=0.7n</math>, and the dimensionality of the subspace <math>k=3</math>. The diagnostic plot is shown as following. Clearly, ROBPCA distinguishes a small group of bad leverage points which all three other PCA methods fails to recognize. Moreover, through next figure, we can see ROBPCA identify the bad leverage points correctly.<br />
<br />
[[File:GarciaF12.jpg]]<br />
<br />
[[File:GarciaF13.jpg]]<br />
<br />
<br />
'''Car data'''<br />
<br />
The car data contains 111 observations with p=11 characteristics measured for each car. The first 2 principle components are chosen since they account for 94% total variance (ROBPCA). The following figures show the diagnostic plots of ROBPCA and CPCA. Although the same set of outliers are detected, the group of bad leverage points from ROBPCA are converted into good leverage points for CPCA.<br />
<br />
[[File:ROBPCA1.png|800px]]<br />
<br />
The scores are plotted in the figures below together with the 97.5% tolerance ellipse. Data points fall outside the ellipse are the good and bad leverage points. The ellipse of CPCA is highly inflated toward the outliers 25, 30, 32, 34 and 36. So the resulting eigenvectors are not lying in the direction of the highest variability of the rest points. The second eigenvalue of CPCA is also blown up by the outliers. In contrast, the tolerance ellipse of ROBPCA is more robust!<br />
<br />
==Simulations==<br />
<br />
In the simulation study, we generate 1000 samples of size n from the contamination model<br />
<center><math>(1-\epsilon)F_{p}(0,\Sigma)+\epsilon F_{p}(\tilde{\mu},\tilde{\Sigma})</math></center><br />
where <math>F_{p}</math> is p-variate Normal distribution or elliptical distribution.<br />
<br />
Different choices of <math>n,p,\epsilon,\Sigma,\tilde{\mu},\tilde{\Sigma}</math> are tried. The following tables and figures report some typical situations:<br />
<br />
1. <math>\,n=100,p=4,\Sigma=diag(8,4,2,1),k=3</math><br />
<br />
(1a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(1b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
2. <math>n=50,p=100,\Sigma=diag(17,13.5,8,3,1,0.095,\cdots,0.001),k=5</math><br />
<br />
(2a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(2b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
For each simulation setting, the results of four methods are summarized as following:<br />
<br />
1. For each method, consider the maximal angle between the true subspace and estimated subspace<ref>Krzanowski, W. J. RBetween-Groups Comparison of Principal Components. Journal of the American Statistical Association, 74:703-707, 1979</ref>:<br />
<center><math>maxsub=\frac{2}{\pi}arccos(\sqrt{\lambda_{k}})</math></center><br />
This is reported in '''table 1'''<br />
<br />
[[File:GarciaF14.jpg]]<br />
<br />
2. Consider the proportion of variability that is preserved by the estimated subspace.<br />
This is reported in '''table 2'''<br />
<br />
[[File:GarciaF15.jpg]]<br />
<br />
3. Consider the mean squared error (MSE) for the k largest eigenvalues<br />
<center><math>MSE(\widehat{\lambda}_{j})=\frac{1}{1000}\sum^{1000}_{l=1}(\widehat{\lambda}_{j}-\lambda_{j})^{2}</math></center><br />
The results for differnt settings are shown in the following figures.<br />
<br />
[[File:GarciaF16.jpg]]<br />
<br />
[[File:GarciaF17.jpg]]<br />
<br />
[[File:GarciaF18.jpg]]<br />
<br />
[[File:GarciaF19.jpg]]<br />
<br />
<br />
The last issue worth mentioning is the computation cost. The ROBPCA is slightly more computationaly expensive than other three methods compared above, but it is still acceptable. The following figure shows the mean CPU time in seconds over 100 runs for varying low-dimensional normal data<br />
<br />
[[File:GarciaF20.jpg]]<br />
<br />
=Reference=<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ROBPCA1.png&diff=22927File:ROBPCA1.png2013-08-15T16:19:30Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=rOBPCA:_A_New_Approach_to_Robust_Principal_Component_Analysis&diff=22926rOBPCA: A New Approach to Robust Principal Component Analysis2013-08-15T16:19:02Z<p>Lxin: /* Examples */</p>
<hr />
<div>=Introduction=<br />
<br />
Principal component analysis (PCA) is a useful tool in statistical learning, which tries to preserve the variability by a small number of principal components. In the classical method, the principal components are chosen as the eigenvectors corresponding to the top several largest eigenvalues of the covariance matrix. Since the classical estimation for covariance matrix is very sensitive to the presence of outliers, it is not surprising that the principal components are also attracted toward outlying points very easily, and no longer reflect the variation of regular data points correctly.<br />
<br />
To overcome this drawback, two types of modification are proposed. The first is to simply replace the covariance matrix estimator by a robust estimator in classical PCA. Related work includes Maronna <ref>Maronna, R. A. Robust M-Estimators of Multivariate Location and Scatter. The Annals of Statistics, 4:51-67, 1976. </ref>, Campbell <ref>Campbell, N. A. Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation. Applied Statistics, 29:231-237, 1980. </ref> and Croux and Haesbroeck <ref>Croux, C. and Haesbroeck, G. Principal Components Analysis based on Robust Estimators of the Covariance or Correlation matrix: Influence Functions and Efficiencies. Biometrika, 87:603-618, 2000. </ref>. But these methods only work nicely when the data are not in high-dimensional space, and the computation cost for these robust estimators will become a serious issue when dimension increases (can only handle up to about 100 dimensions).<br />
<br />
The second way is to use projection pursuit (PP) techniques (See Li and Chen <ref>Li, G., and Chen, Z. Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. Journal of the American Statistical Association, 80:759-766, 1985. </ref>, Croux and Ruiz-Gazen <ref>Croux, C., and Ruiz-Gazen, A. A Fast Algorithm for Robust Principal Components Based on Projection Pursuit. COMPSTAT 1996, Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-Verlag, 211-217, 1996. </ref>). PP obtains the robust principal components by maximize a robust measure of spread.<br />
<br />
The authors proposed a new approach called '''ROBPCA''', which combines the idea of PP and robust scatter matrix estimation. ROBPCA can be computed efficiently, and is able to detect exact-fit situations. Also, it can be used as a diagnostic plot that detects the outliers. <br />
<br />
=ROBPCA=<br />
<br />
The ROBPCA roughly consists of a three step algorithm. First, the data are transformed into a subspace whose dimension is at most <math>n-1</math>. Second, a preliminary covariance matrix is constructed and used for selecting a <math>k_{0}</math>-dimensional subspace that fits the data well. The final step is to project the data into the selected subspace where their location and scatter matrix are robustly estimated, until getting the final score in the <math>k</math>-dimensional subspace.<br />
<br />
<br />
'''Notations''':<br />
<br />
<math>\mathbf{X}_{n,p}</math>: The observed data, <math>n</math> objects and <math>p</math> variables.<br />
<br />
<math>\widehat{\mu}_{0}^{\prime}</math>: mean vector of <math>\mathbf{X}_{n,p}</math>.<br />
<br />
<math>k</math>: the dimension of low-dimensional subspace into which the data are projected.<br />
<br />
<math>r_{0}</math>: Rank of <math>\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}</math>.<br />
<br />
<math>\alpha</math>: tuning parameter that represents the robustness of the procedure.<br />
<br />
<math>t_{MCD}</math> and <math>s_{MCD}</math>: MCD location and scale estimator<ref>Rousseeuw, P. J. Least Median of Squares Regression. Journal of the American Statistical Association, 79:871-880, 1984</ref><br />
<br />
==Detailed ROBPCA algorithm==<br />
<br />
'''Step 1'''<br />
<br />
ROBPCA starts with finding a affine subspace spanned by n data points (as propose by Hubert et al. <ref name="HR">Hubert, M., Rousseeuw, P. J., and Verboven, S. A Fast Method for Robust Principal Components With Applications to Chemometrics. Chemometrics and Intelligent Laboratory Systems, 60:101-111, 2002. </ref>). This is done by performing the SVD:<br />
<br />
<center><math>\,\mathbf{X}_{n,p}-1_{n}\widehat{\mu}_{0}^{\prime}=U_{n,r_{0}}D_{r_{0},r_{0}}V_{r_{0},p}^{\prime}</math></center><br />
<br />
Without losing any information, we can now work in the subspace spanned by the <math>r_{0}</math> columns of <math>V</math>. Thus, <math>\,\mathbf{Z}_{n,r_{0}}=UD</math> becomes the new data matrix.<br />
<br />
----<br />
<br />
'''Step 2'''<br />
<br />
The second step is to find a subset of <math>h<n</math> least outlying data points, and use their covariance matrix to obtain a subspace of dimension <math>k_{0}</math>. The value of <math>h</math> is chosen as<br />
<br />
<center><math>h=\max \left\{ \alpha n, (n+k_{max}+1)/2 \right\}</math></center><br />
where <math>k_{max}</math> represents the maximal number of components that will be computed.<br />
<br />
Then the subset of least outlying data points is found as the following:<br />
<br />
1. For each data point <math>\mathbf{x}_{i}</math> and each direction <math>\mathbf{v}</math>, the '''orthogonally invariant outlyingness'''<br />
is computed:<br />
<center><math>outl_{O}(\mathbf{x}_{i})=\max_{\mathbf{v}} \frac{\left| \mathbf{x}_{i}^{\prime}\mathbf{v}-t_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v}) \right|}{s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})}</math></center><br />
For a direction <math>\mathbf{v}</math> such that <math>s_{MCD}(\mathbf{x}_{j}^{\prime}\mathbf{v})=0</math>, we found a hyperplane orthogonal to <math>\mathbf{v}</math> that contains <math>h</math> observations, therefore reducing the dimension by one.<br />
<br />
Repeat searching until we end up with a dataset in some lower-dimensional space and a set <math>H_{0}</math> indexing the <math>h</math> data points with smallest outlyingness.<br />
<br />
2. Compute the empirical mean <math>\widehat{\mu}_{1}</math> and covariance matrix <math>S_{0}</math> of <math>h</math> points in <math>H_{0}</math>. Perform the spectral decomposition of <math>S_{0}</math>.<br />
<br />
3. Project the data points on the subspace spanned by the first <math>k_{0}</math> eigenvectors of <math>S_{0}</math>, and get the new dataset <math>\mathbf{X}_{n,k_{0}}^{\star}</math><br />
<br />
----<br />
<br />
'''Step 3'''<br />
<br />
The mean and covariance matrix of <math>\mathbf{X}_{n,k_{0}}^{\star}</math> are robustly estimated by FAST-MCD algorithm<ref name="HR">Reference</ref>, and during the iteration procedure, one can keep reducing the dimensionality when the covariance matrix is found to be singular.<br />
<br />
Repeating the FAST-MCD until getting the final dataset <math>\mathbf{X}_{n,k} \in \mathbb{R}^{k}</math>, and the scores <math>\mathbf{T}_{n,k}</math>:<br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,k}-1_{n}\widehat{mu}_{k}^{\prime})\mathbf{P}</math></center><br />
<br />
Finally, <math>\mathbf{P}</math> is transformed back into <math>\mathbb{R}^{p}</math> to obtain the robust principal components <math>\mathbf{P}_{p,k}</math> such that <br />
<center><math>\mathbf{T}_{n,k}=(\mathbf{X}_{n,p}-1_{n}\widehat{mu}^{\prime})\mathbf{P}_{p,k}</math></center><br />
<br />
Moreover, a robust scatter matrix <math>\mathbf{S}</math> of rank k is also generated by<br />
<center><math>\mathbf{S}=\mathbf{P}_{p,k}\mathbf{L}_{k,k}\mathbf{P}_{p,k}{\prime}</math></center><br />
where <math>\mathbf{L}_{k,k}</math> is the diagonal matrix with eigenvalues <math>l_{1},\cdots,l_{k}</math><br />
<br />
----<br />
<br />
'''Remarks'''<br />
<br />
1. Step 1 is useful especially when the number of variables are larger than the sample size (<math>p>n</math>)<br />
<br />
2. In step 2, the choice of <math>\alpha</math> reflects the trade-off between efficiency and robustness, i.e. the higher<br />
the <math>\alpha</math>, the more efficient the estimates will be for uncontaminated data, and the lower the <math>\alpha</math>, the more robust the estimator will be for contaminated samples.<br />
<br />
3. Unlike some other robust PCA method, ROBPCA shares a very nice property with classical PCA: it is location and orthogonal equivariant.<br />
<br />
==Diagnostic==<br />
<br />
The ROBPCA can be also used to flag the outliers in the sample. Usually, we can roughly consider the points in the dataset to be in four types:<br />
<br />
1. ''regular''<br />
<br />
2. ''good leverage'': far from regular points, but lie close to the true subspace, as point 1 and 4 in figure 1.<br />
<br />
3. ''bad leverage'': far from regular points, and also have a large orthogonal distance to the true subspace, as point 2 and 3 in figure 1.<br />
<br />
4. ''orthogonal outliers'': have a large orthogonal distance to the true subspace, but close to the regular points if projected into the true subspace, as point 5 in figure 1.<br />
<br />
[[File:GarciaF11.jpg]]<br />
<br />
<br />
A ''diagnostic plot'' can be constructed to identify the types of each point as following:<br />
<br />
1. On the horizontal axis, the ''robust score distance'' <math>SD_{i}</math> of each observation are plotted.<br />
<center><math>SD_{i}=\sqrt{\sum_{j=1}^{k}\frac{t_{ij}^{2}}{l_{j}}}</math></center><br />
where the scores <math>t_{ij}</math> and <math>l_{j}</math>are obtained from step 3.<br />
<br />
2. On the vertical axis, the ''orthogonal distance'' <math>OD_{i}</math> of each observation to the PCA subspace are plotted.<br />
<center><math>OD_{i}=\left\| \mathbf{x}_{i}-\widehat{mu}-\mathbf{P}_{p,k}\mathbf{t}_{i}^{\prime} \right\|</math></center><br />
<br />
Then one can draw two cutoff lines (on horizontal and vertical axis respectively), and divide the diagnostic plot into four zones. Points fall into the lower-left zone are ''regular'' points, lower-right are ''good leverage'' points, upper-left are ''orthogonal outlying'' points, and upper-right are ''bad leverage'' points.<br />
<br />
=Example and Simulations=<br />
<br />
The performances of ROBPCA and the diagnostic plot are illustrated by some real data example and simulation studies. The comparison is carried out between ROBPCA and other four types of PCA: classical PCA (CPCA), RAPCA<ref name="HR">Reference</ref>, spherical PCA (SPHER) and ellipsoidal PCA (ELL)<ref>Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T., and Cohen, K. L. Robust Principal Component Analysis for Functional Data. Test, 8:1-73, 1999</ref>, where the last three methods are also designed to be robust for high-dimensional data.<br />
<br />
==Examples==<br />
<br />
'''Glass data'''<br />
<br />
The ''Glass dataset'' consists of 180 glass samples (<math>n=180</math>), with 750 variables (<math>p=750</math>). <br />
<br />
Perform ROBPCA on this dataset with the choice of <math>h=126=0.7n</math>, and the dimensionality of the subspace <math>k=3</math>. The diagnostic plot is shown as following. Clearly, ROBPCA distinguishes a small group of bad leverage points which all three other PCA methods fails to recognize. Moreover, through next figure, we can see ROBPCA identify the bad leverage points correctly.<br />
<br />
[[File:GarciaF12.jpg]]<br />
<br />
[[File:GarciaF13.jpg]]<br />
<br />
'''Car data'''<br />
The car data contains 111 observations with p=11 characteristics measured for each car. The first 2 principle components are chosen since they account for 94% total variance (ROBPCA). The following figures show the diagnostic plots of ROBPCA and CPCA. Although the same set of outliers are detected, the group of bad leverage points from ROBPCA are converted into good leverage points for CPCA.<br />
<br />
==Simulations==<br />
<br />
In the simulation study, we generate 1000 samples of size n from the contamination model<br />
<center><math>(1-\epsilon)F_{p}(0,\Sigma)+\epsilon F_{p}(\tilde{\mu},\tilde{\Sigma})</math></center><br />
where <math>F_{p}</math> is p-variate Normal distribution or elliptical distribution.<br />
<br />
Different choices of <math>n,p,\epsilon,\Sigma,\tilde{\mu},\tilde{\Sigma}</math> are tried. The following tables and figures report some typical situations:<br />
<br />
1. <math>\,n=100,p=4,\Sigma=diag(8,4,2,1),k=3</math><br />
<br />
(1a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(1b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
2. <math>n=50,p=100,\Sigma=diag(17,13.5,8,3,1,0.095,\cdots,0.001),k=5</math><br />
<br />
(2a) <math>\,\epsilon=0</math>: no comtamination<br />
<br />
(2b) <math>\,\epsilon</math>: 0.1 or 0.2<br />
<br />
<math>\tilde{\mu}=(0,0,0,0,0,f_{1})^{\prime},\tilde{\Sigma}=\Sigma/f_{2}</math><br />
<br />
<math>f_{1}=6,8,10,\cdots,20</math><br />
<br />
<math>\,f_{2}=1,15</math><br />
<br />
For each simulation setting, the results of four methods are summarized as following:<br />
<br />
1. For each method, consider the maximal angle between the true subspace and estimated subspace<ref>Krzanowski, W. J. RBetween-Groups Comparison of Principal Components. Journal of the American Statistical Association, 74:703-707, 1979</ref>:<br />
<center><math>maxsub=\frac{2}{\pi}arccos(\sqrt{\lambda_{k}})</math></center><br />
This is reported in '''table 1'''<br />
<br />
[[File:GarciaF14.jpg]]<br />
<br />
2. Consider the proportion of variability that is preserved by the estimated subspace.<br />
This is reported in '''table 2'''<br />
<br />
[[File:GarciaF15.jpg]]<br />
<br />
3. Consider the mean squared error (MSE) for the k largest eigenvalues<br />
<center><math>MSE(\widehat{\lambda}_{j})=\frac{1}{1000}\sum^{1000}_{l=1}(\widehat{\lambda}_{j}-\lambda_{j})^{2}</math></center><br />
The results for differnt settings are shown in the following figures.<br />
<br />
[[File:GarciaF16.jpg]]<br />
<br />
[[File:GarciaF17.jpg]]<br />
<br />
[[File:GarciaF18.jpg]]<br />
<br />
[[File:GarciaF19.jpg]]<br />
<br />
<br />
The last issue worth mentioning is the computation cost. The ROBPCA is slightly more computationaly expensive than other three methods compared above, but it is still acceptable. The following figure shows the mean CPU time in seconds over 100 runs for varying low-dimensional normal data<br />
<br />
[[File:GarciaF20.jpg]]<br />
<br />
=Reference=<br />
<references /></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22898large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:58:49Z<p>Lxin: /* Numerical examples */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
An important observation is that the ''i''-th feature is absent at optimum if <math>(a_i^T\xi)^2\leq \lambda</math> for every <math>\xi,\Vert \xi \Vert_2=1</math>. Hence, the feature ''i'' with <math>\Sigma_{ii}=a_i^Ta_i<\lambda</math> can be safely removed.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.<br />
<br />
=Numerical examples=<br />
The algorithm is applied to two large text data sets. The NYTimes new articles data contains 300,000 articles and a dictionary of 102,660 unique words (1GB file size). And the PubMed data set has 8,200,000 abstracts with 141,043 unique words (7.8GB file size). They are too large for classical PCA to work. <br />
<br />
The authors set the target cardinality for each principal component to be 5. They claim that it only takes the algorithm about 20 seconds to search for a proper range of <math>lambda</math> for such target cardinality. In the end, the block coordinate ascent algorithm works on a covariance matrix of order at most n=500, instead of 102,660 for NYTimes data and n=1000, instead of 141,043 for the PubMed data. <br />
<br />
The top 5 sparse components for the two data sets are shown in the following tables. They claim that the sparse principle components still unambiguously identify and perfectly correspond to the topics used by ''The New York Times'' on its website.<br />
<br />
[[File:SPCA1.png|770px|center]]<br />
<br />
[[File:SPCA2.png|800px|center]]</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22897large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:58:27Z<p>Lxin: /* Numerical examples */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
An important observation is that the ''i''-th feature is absent at optimum if <math>(a_i^T\xi)^2\leq \lambda</math> for every <math>\xi,\Vert \xi \Vert_2=1</math>. Hence, the feature ''i'' with <math>\Sigma_{ii}=a_i^Ta_i<\lambda</math> can be safely removed.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.<br />
<br />
==Numerical examples==<br />
The algorithm is applied to two large text data sets. The NYTimes new articles data contains 300,000 articles and a dictionary of 102,660 unique words (1GB file size). And the PubMed data set has 8,200,000 abstracts with 141,043 unique words (7.8GB file size). They are too large for classical PCA to work. <br />
<br />
The authors set the target cardinality for each principal component to be 5. They claim that it only takes the algorithm about 20 seconds to search for a proper range of <math>lambda</math> for such target cardinality. In the end, the block coordinate ascent algorithm works on a covariance matrix of order at most n=500, instead of 102,660 for NYTimes data and n=1000, instead of 141,043 for the PubMed data. <br />
<br />
The top 5 sparse components for the two data sets are shown in the following tables. They claim that the sparse principle components still unambiguously identify and perfectly correspond to the topics used by ''The New York Times'' on its website.<br />
<br />
[[File:SPCA1.png|770px|center]]<br />
<br />
[[File:SPCA2.png|800px|center]]</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22896large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:57:39Z<p>Lxin: /* Numerical examples */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
An important observation is that the ''i''-th feature is absent at optimum if <math>(a_i^T\xi)^2\leq \lambda</math> for every <math>\xi,\Vert \xi \Vert_2=1</math>. Hence, the feature ''i'' with <math>\Sigma_{ii}=a_i^Ta_i<\lambda</math> can be safely removed.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.<br />
<br />
==Numerical examples==<br />
The algorithm is applied to two large text data sets. The NYTimes new articles data contains 300,000 articles and a dictionary of 102,660 unique words (1GB file size). And the PubMed data set has 8,200,000 abstracts with 141,043 unique words (7.8GB file size). They are too large for classical PCA to work. <br />
<br />
The authors set the target cardinality for each principal component to be 5. They claim that it only takes the algorithm about 20 seconds to search for a proper range of <math>lambda</math> for such target cardinality. In the end, the block coordinate ascent algorithm works on a covariance matrix of order at most n=500, instead of 102,660 for NYTimes data and n=1000, instead of 141,043 for the PubMed data. <br />
<br />
The top 5 sparse components for the two data sets are shown in the following tables. They claim that the sparse principle components still unambiguously identify and perfectly correspond to the topics used by ''The New York Times'' on its website.<br />
<br />
[[File:SPCA1.png|700px|center]]<br />
<br />
[[File:SPCA2.png|700px|center]]</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:SPCA2.png&diff=22895File:SPCA2.png2013-08-14T15:56:34Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:SPCA1.png&diff=22894File:SPCA1.png2013-08-14T15:56:17Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22893large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:55:56Z<p>Lxin: /* Block Coordinate Ascent Algorithm */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
An important observation is that the ''i''-th feature is absent at optimum if <math>(a_i^T\xi)^2\leq \lambda</math> for every <math>\xi,\Vert \xi \Vert_2=1</math>. Hence, the feature ''i'' with <math>\Sigma_{ii}=a_i^Ta_i<\lambda</math> can be safely removed.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.<br />
<br />
==Numerical examples==<br />
The algorithm is applied to two large text data sets. The NYTimes new articles data contains 300,000 articles and a dictionary of 102,660 unique words (1GB file size). And the PubMed data set has 8,200,000 abstracts with 141,043 unique words (7.8GB file size). They are too large for classical PCA to work. <br />
<br />
The authors set the target cardinality for each principal component to be 5. They claim that it only takes the algorithm about 20 seconds to search for a proper range of <math>lambda</math> for such target cardinality. In the end, the block coordinate ascent algorithm works on a covariance matrix of order at most n=500, instead of 102,660 for NYTimes data and n=1000, instead of 141,043 for the PubMed data. <br />
<br />
The top 5 sparse components for the two data sets are shown in the following tables. They claim that the sparse principle components still unambiguously identify and perfectly correspond to the topics used by ''The New York Times'' on its website.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22892large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:39:33Z<p>Lxin: /* Primal problem */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
An important observation is that the ''i''-th feature is absent at optimum if <math>(a_i^T\xi)^2\leq \lambda</math> for every <math>\xi,\Vert \xi \Vert_2=1</math>. Hence, the feature ''i'' with <math>\Sigma_{ii}=a_i^Ta_i<\lambda</math> can be safely removed.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=large-Scale_Supervised_Sparse_Principal_Component_Analysis&diff=22891large-Scale Supervised Sparse Principal Component Analysis2013-08-14T15:34:24Z<p>Lxin: /* Primal problem */</p>
<hr />
<div>= Introduction =<br />
<br />
The sparse PCA is a variant of the classical PCA, which assumes sparsity in the feature space. It has several advantages such as easy to interpret, and works for really high-dimensional data. The main issue about sparse PCA is that it is computationally expensive. Many algorithms have been proposed to solve the sparse PCA problem, and the authors introduced a fast block coordinate ascent algorithm with much better computational complexity.<br />
<br />
'''1 Drawbacks of Existing techniques'''<br />
<br />
Existing techniques include ad-hoc methods(e.g. factor rotation techniques, simple thresholding), greedy algorithms, SCoTLASS, the regularized SVD method, SPCA, the generalized power method. These methods are based on non-convex optimization and they don't guarantee global optimum.<br />
<br />
A semi-definite relaxation method called DSPCA can guarantee global convergence and has better performance than above algorithms, however, it is computationally expensive. <br />
<br />
'''2 Contribution of this paper'''<br />
<br />
This paper solves DSPCA in a computationally easier way, and hence it is a good solution for large scale data sets. This paper applies a block coordinate ascent algorithm with computational complexity <math>O(\hat{n^3})</math>, where <math>\hat{n}</math> is the intrinsic dimension of the data. Since <math>\hat{n}</math> could be very small compared to the dimension <math>n</math> of the data, this algorithm is computationally easy.<br />
<br />
=Primal problem =<br />
<br />
The sparse PCA problem can be formulated as <math>max_x \ x^T \Sigma x - \lambda \| x \|_0 : \| x \|_2=1</math>.<br />
<br />
This is equivalent to <math>max_z \ Tr(\Sigma Z) - \lambda \sqrt{\| Z \|_0} : Z \succeq 0, Tr Z=1, Rank(Z)=1</math>.<br />
<br />
Replacing the <math>\sqrt{\| Z \|_0}</math> with <math>\| Z \|_1</math> and dropping the rank constraint gives a relaxation of the original non-convex problem:<br />
<br />
<math>\phi = max_z Tr (\Sigma Z) - \lambda \| Z \|_1 : Z \succeq 0</math>, <math>Tr(Z)=1 \qquad (1)</math> .<br />
<br />
Fortunately, this relaxation approximates the original non-convex problem to a convex problem.<br />
<br />
Here is an important theorem used by this paper:<br />
<br />
Theorem(2.1) Let <math>\Sigma=A^T A</math> where <math>A=(a_1,a_2,......,a_n) \in {\mathbb R}^{m \times n}</math>, we have <math>\psi = max_{\| \xi \|_2=1}</math> <math>\sum_{i=1}^{n} (({a_i}^T \xi)^2 - \lambda)_+</math>. An optimal non-zero pattern corresponds to the indices <math>i</math> with <math>\lambda < (({a_i}^T \xi)^2-\lambda)_+</math> at optimum.<br />
<br />
=Block Coordinate Ascent Algorithm =<br />
There is a row-by-row algorithm applied to the problems of the form <math>min_X \ f(X)-\beta \ log(det X): \ L \leq X \leq U, X \succ 0</math>.<br />
<br />
Problem (1) can be written as <math>{\frac 1 2} {\phi}^2 = max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2: X \succeq 0 \qquad (2)</math> .<br />
<br />
In order to apply the row by row algorithm, we need to add one more term <math>\beta \ log(det X)</math> to (2) where <math>\beta>0</math> is a penalty parameter.<br />
<br />
That is to say, we address the problem <math>\ max_X \ Tr \Sigma X - \lambda \| X \|_1 - \frac 1 2 (Tr X)^2 + \beta \ log(det X): X \succeq 0 \qquad (3)</math><br />
<br />
By matrix partitioning, we could obtain the sub-problem:<br />
<br />
<math>\phi = max_{x,y} \ 2(y^T s- \lambda \| y \|_1) +(\sigma - \lambda)x - {\frac 1 2}(t+x)^2 + \beta \ log(x-y^T Y^{\dagger} y ):y \in R(Y) \qquad (4)</math>. <br />
<br />
By taking the dual of (4), the sub-problem can be simplified to be<br />
<br />
<math> {\phi}^' = min_{u,z} {\frac 1 {\beta z}} u^T Yu - \beta (log z) + {\frac 1 2} (c+ \beta z)^2 : z>0, \| u-s \|_\infty \leq \lambda </math><br />
<br />
Since <math> \beta </math> is very small, and we want to avoid large value of <math> z </math>, we could change variable <math>r=\beta z</math>, then the optimization problem become<br />
<br />
<math> {\phi}^' - \beta (log \beta) = min_{u,r} {\frac 1 r} u^T Yu - \beta (log r) + {\frac 1 2} (c+r)^2 : r>0, \| u-s \|_\infty \leq \lambda \qquad (5)</math><br />
<br />
We can solve the sub-problem (5) by first the box constraint QP <br />
<br />
<math>R^2 := min_u u^T Yu : \| u - s \|_\infty \leq \lambda</math> <br />
<br />
and then set <math>r</math> by solving <br />
<br />
<math> min_{r>0} {\frac {R^2} r} - \beta (log r) + {\frac 1 2} (c+r)^2 </math><br />
<br />
Once the above sub-problem is solved, we can obtain the primal variables <math>y,x</math> by setting <math> y= {\frac 1 r} Y u</math> and for the diagonal element <math>x</math> we have <math> x=c+r=\sigma - \lambda -t+r </math><br />
<br />
Here is the algorithm:<br />
<br />
<br />
[[File:algorithm.jpg]]<br />
<br />
<br />
'''Convergence and complexity'''<br />
<br />
1. The algorithm is guaranteed to converge to the global optimizer.<br />
<br />
2. The complexity for the algorithm is <math>O(K \hat{n^3})</math>, where <math>K</math> is the number of sweeps through columns (fixed, typically <math>K=5</math>), and <math>(\hat{n^3})</math> is the intrinsic dimension of the data points.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Indian_Buffet_Process:_An_Introduction_and_Review&diff=22890the Indian Buffet Process: An Introduction and Review2013-08-14T14:21:09Z<p>Lxin: /* Properties of the distribution */</p>
<hr />
<div>The Indian Buffet Process (IBP) is one of Bayesian nonparametric models, which is a prior measure on an infinite binary matrix.<br />
Unlike the Dirichlet process(DP), where each atom has negative correlation, IBP assumes each atom is independent.<br />
<br />
<br />
==Introduction==<br />
IBP is often used in factor analysis as a prior of infinite factors.<br />
<br />
IBP can be viewed as an extension of DP, where we drop the constraint <math> \sum_{i=1}^{\inf}{\pi_i}=1 </math>.<br />
<br />
Because we drop the constraint, it does not naturally use IBP as a prior of mixture models. <br />
<br />
==Representations==<br />
Like DP, IBP has several representations.<br />
<br />
===the limiting of finite distribution on sparse binary feature matrices===<br />
We have N data points and K features and the possession of feature k by data point i is indicated by a binary variable <math> z_{ik} </math><br />
The generative process of the binary feature matrix is defined as below:<br />
<br />
* for each feature k<br />
** for each data point i<br />
<br />
*** <math>\pi_k</math> ~ <math> Beta(\frac{\alpha}{K},1) </math><br />
<br />
*** <math>z_{ik}</math> ~ <math> Bernoulli(\pi_k) </math><br />
<br />
where <math> \alpha </math> is a hyper-parameter, which is similar to the parameter defined in DP. <br />
When K goes into infinite, such generative process will become IBP.<br />
<br />
===stick breaking construction===<br />
<br />
*For each feature k<br />
**<math>\mu_k </math> ~ <math> Beta(\alpha,1) </math> <br />
**<math>\pi_{k}=\prod_{l=1}^{k}(\mu_l) </math><br />
**For each data point i<br />
***<math>z_{ik}</math> ~ <math> Bernoulli(\pi_k) </math><br />
<br />
===the Indian buffet metaphor===<br />
N customers enter a restaurant one after another. Each customer encounters a buffet consisting of infinitely many dishes arranged in a line.<br />
The first customer starts at the left of the buffet and takes a serving from each dishes, stopping after a <math>Poisson(\alpha)</math> number of dishes as his plate becomes overburdened.<br />
<br />
The ith customer moves along the buffet,sampling dishes in proportion to their popularity, serving himself with probability <math> \frac{m_k}{i} </math>, where m_k is the number of previous customers who have sampled a dish. Having reached the end of all previous sampled dishes, the ith customer then tries a <math>Poisson(\frac{\alpha}{i})</math> number of new dishes.<br />
<br />
==Properties of the distribution==<br />
*The effective dimension of the distribution, which is the number of columns with at least one non-zero component, follow a <math>Poisson(\alpha H_N)</math>, where H<sub>N</sub> is the ''N''th harmonic number, i.e. <math>H_N=\sum_{j=1}^N \frac{1}{j}</math>.<br />
*The number of feature possessed by each object follow a <math>Poisson(\alpha)</math> distribution. This follows from the exchangeable property of IBP. <br />
*The binary matrix generated from IBP remains sparse as <math>K\rightarrow \infty</math>. Actually, <math>lim_{K\rightarrow \infty}E[1^T Z1]=N\alpha</math>.<br />
<br />
==Inference==<br />
<br />
Like DP, we can use MCMC sampling framework based on sticking breaking construction and the Indian buffet metaphor to stimulate random samples from IBP.<br />
<br />
We can use Gibbs sampling to generate sample from IBP.<br />
<br />
Choosing an ordering on data points such as the ith data point corresponds to the last customer to visit the buffet, we obtain:<br />
<math><br />
p(z_{ik}=1|z_{-ik})=\frac{m_{-i,k}}{N}<br />
</math><br />
for any feature k such that <math>m_{-ik}>0</math><br />
Similarly the number of new features associated with data point i should be drawn from a <math> Poisson(\frac{\alpha}{N}) </math> distribution.<br />
where <math> z_{-ik} </math> denotes the set of assignments of other data points, not including data point i for feature k and <math> m_{-i,k} </math> is the number of data points possessing feature k, not including i.<br />
<br />
==Application==<br />
<br />
Due to the fact of IBP that each atom in IBP is independent, IBP can be used as a building block to construct a hierarchical mixture model.<br />
<br />
In the Indian Buffet Process compound Dirichlet process model (IBPCDP), the authors proposed to use IBP to select independent weight and then to use DP to normalize these weights. The model is closed related to hierarchical Dirichlet process(HDP), where the top level and down level are both drawn from DP.<br />
<br />
==References==<br />
<br />
Williamson, Sinead, et al. "The IBP compound Dirichlet process and its application to focused topic modeling." (2010).<br />
<br />
Griffiths, Thomas L., and Zoubin Ghahramani. "The Indian Buffet Process: An Introduction and Review." Journal of Machine Learning Research 12 (2011): 1185-1224.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Indian_Buffet_Process:_An_Introduction_and_Review&diff=22889the Indian Buffet Process: An Introduction and Review2013-08-14T14:10:03Z<p>Lxin: </p>
<hr />
<div>The Indian Buffet Process (IBP) is one of Bayesian nonparametric models, which is a prior measure on an infinite binary matrix.<br />
Unlike the Dirichlet process(DP), where each atom has negative correlation, IBP assumes each atom is independent.<br />
<br />
<br />
==Introduction==<br />
IBP is often used in factor analysis as a prior of infinite factors.<br />
<br />
IBP can be viewed as an extension of DP, where we drop the constraint <math> \sum_{i=1}^{\inf}{\pi_i}=1 </math>.<br />
<br />
Because we drop the constraint, it does not naturally use IBP as a prior of mixture models. <br />
<br />
==Representations==<br />
Like DP, IBP has several representations.<br />
<br />
===the limiting of finite distribution on sparse binary feature matrices===<br />
We have N data points and K features and the possession of feature k by data point i is indicated by a binary variable <math> z_{ik} </math><br />
The generative process of the binary feature matrix is defined as below:<br />
<br />
* for each feature k<br />
** for each data point i<br />
<br />
*** <math>\pi_k</math> ~ <math> Beta(\frac{\alpha}{K},1) </math><br />
<br />
*** <math>z_{ik}</math> ~ <math> Bernoulli(\pi_k) </math><br />
<br />
where <math> \alpha </math> is a hyper-parameter, which is similar to the parameter defined in DP. <br />
When K goes into infinite, such generative process will become IBP.<br />
<br />
===stick breaking construction===<br />
<br />
*For each feature k<br />
**<math>\mu_k </math> ~ <math> Beta(\alpha,1) </math> <br />
**<math>\pi_{k}=\prod_{l=1}^{k}(\mu_l) </math><br />
**For each data point i<br />
***<math>z_{ik}</math> ~ <math> Bernoulli(\pi_k) </math><br />
<br />
===the Indian buffet metaphor===<br />
N customers enter a restaurant one after another. Each customer encounters a buffet consisting of infinitely many dishes arranged in a line.<br />
The first customer starts at the left of the buffet and takes a serving from each dishes, stopping after a <math>Poisson(\alpha)</math> number of dishes as his plate becomes overburdened.<br />
<br />
The ith customer moves along the buffet,sampling dishes in proportion to their popularity, serving himself with probability <math> \frac{m_k}{i} </math>, where m_k is the number of previous customers who have sampled a dish. Having reached the end of all previous sampled dishes, the ith customer then tries a <math>Poisson(\frac{\alpha}{i})</math> number of new dishes.<br />
<br />
==Properties of the distribution==<br />
<br />
<br />
==Inference==<br />
<br />
Like DP, we can use MCMC sampling framework based on sticking breaking construction and the Indian buffet metaphor to stimulate random samples from IBP.<br />
<br />
We can use Gibbs sampling to generate sample from IBP.<br />
<br />
Choosing an ordering on data points such as the ith data point corresponds to the last customer to visit the buffet, we obtain:<br />
<math><br />
p(z_{ik}=1|z_{-ik})=\frac{m_{-i,k}}{N}<br />
</math><br />
for any feature k such that <math>m_{-ik}>0</math><br />
Similarly the number of new features associated with data point i should be drawn from a <math> Poisson(\frac{\alpha}{N}) </math> distribution.<br />
where <math> z_{-ik} </math> denotes the set of assignments of other data points, not including data point i for feature k and <math> m_{-i,k} </math> is the number of data points possessing feature k, not including i.<br />
<br />
==Application==<br />
<br />
Due to the fact of IBP that each atom in IBP is independent, IBP can be used as a building block to construct a hierarchical mixture model.<br />
<br />
In the Indian Buffet Process compound Dirichlet process model (IBPCDP), the authors proposed to use IBP to select independent weight and then to use DP to normalize these weights. The model is closed related to hierarchical Dirichlet process(HDP), where the top level and down level are both drawn from DP.<br />
<br />
==References==<br />
<br />
Williamson, Sinead, et al. "The IBP compound Dirichlet process and its application to focused topic modeling." (2010).<br />
<br />
Griffiths, Thomas L., and Zoubin Ghahramani. "The Indian Buffet Process: An Introduction and Review." Journal of Machine Learning Research 12 (2011): 1185-1224.</div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22875nonparametric Latent Feature Models for Link Prediction2013-08-12T16:20:16Z<p>Lxin: /* Discussion */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results. It can be seen the LFRM out-performs both the IRM and MMSB. <br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
===Predicting NIPS coauthorship===<br />
LFRM is applied to the NIPS dataset, which contains a list of all papers and authors from NIPS 1-17. The 234 authors who published with the most other people are investigated. Again, 80% data is used as training set and the rest 20% is test set. The figure below clearly shows that LFRM performs better than IMR and MMSB. The AUC values are LFRM w/IRM 0.9509 > LFRM rand 0.9466 > IRM 0.8906 > MMSB 0.8705. <br />
<br />
[[File:NLFMfig3.png|700px|center]]<br />
<br />
==Conclusion==<br />
In this paper, a nonparametric latent feature relational model is proposed for inferring latent binary features in relational entities, and link prediction. The model combines the ideas of latent feature modeling in networks with Bayesian nonparametrics inference. It can infer the dimension of feature space simultaneously when inferring the entities of the features. The model performs better than established class-based models, e.g. IRM and MMSB. The reason is that the NLFM is richer and more complex. <br />
<br />
==Discussion==<br />
1. The model sets up a new framework for network modeling. <br />
<br />
2. It performs well in terms of estimating and predicting. <br />
<br />
<br />
3. However, the algorithm is quite complicated and unsophisticated. <br />
<br />
4. The algorithm highly depends on the initial values, which means one need to run another algorithm, e.g. IRM, for initial value.<br />
<br />
5. The inferred latent features are not interpretable due to the confounding with the weight matrix.<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22874nonparametric Latent Feature Models for Link Prediction2013-08-12T16:19:45Z<p>Lxin: /* Discussion */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results. It can be seen the LFRM out-performs both the IRM and MMSB. <br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
===Predicting NIPS coauthorship===<br />
LFRM is applied to the NIPS dataset, which contains a list of all papers and authors from NIPS 1-17. The 234 authors who published with the most other people are investigated. Again, 80% data is used as training set and the rest 20% is test set. The figure below clearly shows that LFRM performs better than IMR and MMSB. The AUC values are LFRM w/IRM 0.9509 > LFRM rand 0.9466 > IRM 0.8906 > MMSB 0.8705. <br />
<br />
[[File:NLFMfig3.png|700px|center]]<br />
<br />
==Conclusion==<br />
In this paper, a nonparametric latent feature relational model is proposed for inferring latent binary features in relational entities, and link prediction. The model combines the ideas of latent feature modeling in networks with Bayesian nonparametrics inference. It can infer the dimension of feature space simultaneously when inferring the entities of the features. The model performs better than established class-based models, e.g. IRM and MMSB. The reason is that the NLFM is richer and more complex. <br />
<br />
==Discussion==<br />
1. The model sets up a new framework for network modeling. <br />
<br />
2. It performs well in terms of estimating and predicting. <br />
<br />
<br />
3. However, the algorithm is quite complicated and unsophisticated. <br />
<br />
4. The algorithm depends highly on the initial values, which means one need to run another algorithm, e.g. IRM, for initial value.<br />
<br />
5. The inferred latent features are not interpretable due to the confounding with the weight matrix.<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22873nonparametric Latent Feature Models for Link Prediction2013-08-12T16:19:18Z<p>Lxin: /* Conclusion */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results. It can be seen the LFRM out-performs both the IRM and MMSB. <br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
===Predicting NIPS coauthorship===<br />
LFRM is applied to the NIPS dataset, which contains a list of all papers and authors from NIPS 1-17. The 234 authors who published with the most other people are investigated. Again, 80% data is used as training set and the rest 20% is test set. The figure below clearly shows that LFRM performs better than IMR and MMSB. The AUC values are LFRM w/IRM 0.9509 > LFRM rand 0.9466 > IRM 0.8906 > MMSB 0.8705. <br />
<br />
[[File:NLFMfig3.png|700px|center]]<br />
<br />
==Conclusion==<br />
In this paper, a nonparametric latent feature relational model is proposed for inferring latent binary features in relational entities, and link prediction. The model combines the ideas of latent feature modeling in networks with Bayesian nonparametrics inference. It can infer the dimension of feature space simultaneously when inferring the entities of the features. The model performs better than established class-based models, e.g. IRM and MMSB. The reason is that the NLFM is richer and more complex. <br />
<br />
==Discussion==<br />
1. The model sets up a new framework for network modeling. <br />
<br />
2. It performs well in terms of estimating the probability and predicting. <br />
<br />
<br />
3. However, the algorithm is quite complicated and unsophisticated. <br />
<br />
4. The algorithm depends highly on the initial values, which means one need to run another algorithm, e.g. IRM, for initial value.<br />
<br />
5. The inferred latent features are not interpretable due to the confounding with the weight matrix.<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NLFMfig3.png&diff=22872File:NLFMfig3.png2013-08-12T16:04:40Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22871nonparametric Latent Feature Models for Link Prediction2013-08-12T16:03:51Z<p>Lxin: /* Predicting NIPS coauthorship */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results. It can be seen the LFRM out-performs both the IRM and MMSB. <br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
===Predicting NIPS coauthorship===<br />
LFRM is applied to the NIPS dataset, which contains a list of all papers and authors from NIPS 1-17. The 234 authors who published with the most other people are investigated. Again, 80% data is used as training set and the rest 20% is test set. The figure below clearly shows that LFRM performs better than IMR and MMSB. The AUC values are LFRM w/IRM 0.9509 > LFRM rand 0.9466 > IRM 0.8906 > MMSB 0.8705. <br />
<br />
[[File:NLFMfig3.png|700px|center]]<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22870nonparametric Latent Feature Models for Link Prediction2013-08-12T15:55:38Z<p>Lxin: /* Multi-relational datasets */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results. It can be seen the LFRM out-performs both the IRM and MMSB. <br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
===Predicting NIPS coauthorship===<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22869nonparametric Latent Feature Models for Link Prediction2013-08-12T15:50:29Z<p>Lxin: /* Multi-relational datasets */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results.<br />
<br />
[[File:NLFMfig2.png|700px|center]]<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NLFMfig2.png&diff=22868File:NLFMfig2.png2013-08-12T15:49:36Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22867nonparametric Latent Feature Models for Link Prediction2013-08-12T15:49:13Z<p>Lxin: /* Multi-relational datasets */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref>. One dataset contains 54 relations of 14 countries along with 90 given features of the countries. Another dataset contains 26 kinship relationships of 104 people in the Alyawarra tribe in Central Australia. The model is compared to two other class-based algorithms, the IRM and the MMSB (Mixed Membership Stochastic Blockmodel). <br />
<br />
For each dataset, 80% of the data are used as training set and the AUC (area under the ROC curve) is reported for the held-out data (the 20% left data). Note that the closer the AUC to 1, the better. For the latent feature relational model, either a random feature matrix or class-based features from the IRM is used as initializations. The following table shows the results.<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22866nonparametric Latent Feature Models for Link Prediction2013-08-12T15:36:50Z<p>Lxin: /* Multi-relational datasets */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
In this session, the NLFM is applied to several datasets from the Infinite Relational Model(IRM) paper <ref>Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2006.</ref><br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22865nonparametric Latent Feature Models for Link Prediction2013-08-12T15:32:13Z<p>Lxin: /* Synthetic data */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveals the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22864nonparametric Latent Feature Models for Link Prediction2013-08-12T15:31:27Z<p>Lxin: /* Synthetic data */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features (shown in Figure 1(a), (c)). ''W'' is initialized randomly. The basic model is able to attain 100% accuracy on held-out data. However, it reveal the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png|700px|center]]<br />
<br />
===Multi-relational datasets===<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22863nonparametric Latent Feature Models for Link Prediction2013-08-12T15:27:36Z<p>Lxin: /* Synthetic data */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features. The basic model is able to attain 100% accuracy on held-out data. However, it reveal the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
[[File:NLFMfig1.png]]<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NLFMfig1.png&diff=22862File:NLFMfig1.png2013-08-12T15:26:29Z<p>Lxin: uploaded a new version of &quot;File:NLFMfig1.png&quot;</p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22861nonparametric Latent Feature Models for Link Prediction2013-08-12T15:23:13Z<p>Lxin: /* Synthetic data */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features. The basic model is able to attain 100% accuracy on held-out data. However, it reveal the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
[[File:NLFMfig1.png]]<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NLFMfig1.png&diff=22860File:NLFMfig1.png2013-08-12T15:21:02Z<p>Lxin: </p>
<hr />
<div></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22859nonparametric Latent Feature Models for Link Prediction2013-08-12T15:18:56Z<p>Lxin: /* Simulations and real data results */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
===Synthetic data===<br />
The basic model is applied to simple synthetic datasets generated from known features. The basic model is able to attain 100% accuracy on held-out data. However, it reveal the problem that the model is not able to address the latent features. This is due to subtle interactions (confounding) between sets of features and weights. So the feature inferred will not in general correspond to interpretable features. It also indicates that there are local optima in the feature space, which means a good initialization is necessary.<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22858nonparametric Latent Feature Models for Link Prediction2013-08-12T15:06:07Z<p>Lxin: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Eric P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22857nonparametric Latent Feature Models for Link Prediction2013-08-12T15:05:47Z<p>Lxin: /* Inference */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''. In the full model, the posterior updates for the coefficients and intercepts are independent.<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22856nonparametric Latent Feature Models for Link Prediction2013-08-12T15:04:18Z<p>Lxin: /* Inference */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
2. Given ''Z'', reample ''W''<br />
<br />
They sequentially resample each of the weights in ''W'' that correspond to non-zero features and drop the ones corresponding to the all-zero features. The difficulty is that we do not have a conjugate prior on ''W'', so direct resampling ''W'' from its posterior is infeasible. Some auxiliary sampling trick and MCMC procedures are used.<br />
<br />
3. Other issues<br />
<br />
Conjugate priors may be placed on the hyperparameters as well. In the case of multiple relations, one can sample ''W<sub>i</sub>'' given ''Z'' independently for each ''i''.<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22780nonparametric Latent Feature Models for Link Prediction2013-08-08T14:20:48Z<p>Lxin: /* Inference */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
Regarding the number of features, they use the fact that in the IBP, the prior distribution on the number of new features for the last customer is <math>Poisson(\alpha/N)</math>. They mentioned that the number of new features should be weighted by the corresponding likelihood term.<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22779nonparametric Latent Feature Models for Link Prediction2013-08-08T14:12:11Z<p>Lxin: /* Inference */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
1. Given ''W'', resample ''Z''<br />
<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22778nonparametric Latent Feature Models for Link Prediction2013-08-08T14:11:51Z<p>Lxin: /* Inference */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
Exact inference for the proposed nonparametric latent feature model is infeasible. The authors adopt Markov Chain Monte Carlo (MCMC) for approximate inference (posterior inference on ''Z'' and ''W''). They alternatively sample from ''Z'' and ''W''. During the procedure, the all zero ''Z'' columns are dropped, since they do not provide any information.<br />
<br />
Given ''W'', resample ''Z''<br />
Since the IBP is exchangeable, so when sample the <math>i^{th}</math> row of ''Z'', they assume that the <math>i^{th}</math> customer is the last one in the process. Let <math>m_k</math> denote the number of non-zero entries in column ''k'', the component <math>z_{ik}</math> is sampled by<br />
<br />
<center><br />
<math><br />
Pr(z_{ik}=1|Z_{-ik},W,Y) \propto m_k Pr(Y|z_{ik}=1,Z_{-ik},W)<br />
</math><br />
</center><br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22777nonparametric Latent Feature Models for Link Prediction2013-08-08T13:50:11Z<p>Lxin: /* Generalizations */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generalized for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22776nonparametric Latent Feature Models for Link Prediction2013-08-08T13:49:53Z<p>Lxin: /* Full model */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
===Generalizations===<br />
The model can be easily generated for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=nonparametric_Latent_Feature_Models_for_Link_Prediction&diff=22775nonparametric Latent Feature Models for Link Prediction2013-08-08T13:49:14Z<p>Lxin: /* Full model */</p>
<hr />
<div>==Introduction==<br />
The goal of this paper <ref>Kurt T. Miller, Thomas L. Griffiths, and Michael I. Jordan. Nonparametric latent feature models for link prediction. NIPS, 2009</ref>is link prediction for a partially observed network, i.e. we observe the links (1 or 0) between some pairs of the nodes in a network and we try to predict the unobserved links. Basically, it builds the model by extracting the latent structure that representing the properties of individual entities. Unlike the latent space model <ref>Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. JASA, 97(460):1090-1098.</ref>, which tries to find the location (arbitrary real valued features) of each node in a latent space, the "latent feature" here mainly refers to a class-based (binary features) representation. They assume a finite number of classes that entities can belong to and the interactions between classes determine the structure of the network. More specifically, the probability of forming a link depends only on the classes of the corresponding pair of nodes. The idea is fairly similar to the stochastic blockmodel <ref>Krzysztof Nowicki and Tom A. B. Snijders. Estimation and prediction for stochastic blockstructures. JASA, 96(455):1077-1087, 2001. </ref> <ref>Edoardo M. Airoldi, David M. Blei, Exic P. Xing, and Stephen E. Fienberg. Mixed membership stochastic block models. In Advances in Neural Information Processing Systems. 2009.</ref>. However, the blockmodels are mainly for community detection/network clustering, but not link prediction. This paper fills in the gap.<br />
<br />
The ''nonparametric latent feature relational model'' is a Bayesian nonparametric model in which each node has binary-valued latent features that influences its relation to other nodes. Known covariates information can also be incorporated. The model can simultaneously infer the number (dimension) of latent features, the values of the features for each node and how the features influence the links.<br />
<br />
==The nonparametric latent feature relational model==<br />
Directed network is considered here. Let <math>Y</math> be the <math>N\times N</math> binary adjacency matrix of a network. The component <math>y_{ij}=1</math> if there is a link from node <math>i</math> to node <math>j</math> and <math>y_{ij}=0</math> if there is no link. The components corresponding to unobserved links are left unfilled. The goal is to learn from the observed links so that we can predict the unfilled entries.<br />
<br />
===Basic model===<br />
Let <math>Z</math> denote the latent features, where <math>Z</math> is a <math>N\times K</math> binary matrix. Each row of <math>Z</math> corresponds to a node and each column correspond to a latent feature such that <math>z_{ij}=1</math> if the <math>i^{th}</math> node has feature <math>k</math> and 0, otherwise. And let <math>Z_i</math> denote the <math>i^{th}</math> row of <math>Z</math> (the feature vector corresponding to node ''i''). Let ''W'' be a <math>K\times K</math> real-valued weight matrix where <math>w_{kk^\prime}</math> is the weight that affects the probability of a link when the corresponding nodes have features <math>k</math> and <math>k^'</math>, respectively. By assuming the link probabilities are conditional independent give the latent features and the weights, the likelihood function can be written as:<br />
<center><br />
<math><br />
Pr(Y|Z, W)=\prod_{i,j}Pr(y_{ij}|Z_i,Z_j, W):=\sigma(Z_i W Z_j^{\top})=\sigma(\sum_{k,k^'}z_{ik}z_{jk^'}w_{kk^'})<br />
</math><br />
</center><br />
where <math>\sigma(.)</math> is a function that transforms values from <math>(-\infty,\infty)</math> to <math>(0,1)</math>.<br />
<br />
Prior distributions are assumed for the latent features and the weight matrix, where ''Z'' is generated by Indian Buffet Process <ref>Thomas L. Griffiths and Zoubin Ghahramani. Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, 2007</ref> and the component of ''W'' has normal prior, i.e.<br />
<center><br />
<math><br />
Z \sim IBP(\alpha)<br />
<br />
</math><br />
</center><br />
<center><br />
<math><br />
w_{kk^'}\sim \mathcal{N}(0, \sigma_w^2)<br />
</math><br />
</center><br />
<br />
===Full model===<br />
The covariates information of each node can be incorporate into the model. The full nonparametric latent feature relational model is <br />
<center><br />
<math><br />
Pr(y_{ij}=1|Z, W, X, \beta, a, b, c)=\sigma(Z_i W Z_j^\top +\beta^\top X_{ij}+(\beta_p^\top X_{p,i}+a_i)+(\beta_c^\top X_{c,i}+b_i)+c)<br />
</math><br />
</center><br />
where <math>X_{p,i},X_{c,j}</math> are known covariate vector when node ''i'' and ''j'' are link parent and child, respectively; <math>X_{ij}</math> is a vector of interaction effects; <math>\beta, \beta_p, \beta_c, a and b </math> are coefficients and offsets which all assumed to be normally distributed. We drop the corresponding terms if no information available.<br />
<br />
The model can be easily generated for multiple relations instead of a single relation. The latent features keep the same, but an independent weight matrix <math>W^i</math> is used for each relation <math>Y^i</math>. Covariates may be relation specific or common across all relations. By taking the weight matrix to be symmetric, the model can deal with undirected networks.<br />
<br />
==Inference==<br />
<br />
==Simulations and real data results==<br />
<br />
==Conclusion==<br />
<br />
<br />
==References==<br />
<references/></div>Lxin