Difference between revisions of "adaptive dimension reduction for clustering high dimensional data"
(→7. Discussion) 
m (Conversion script moved page Adaptive dimension reduction for clustering high dimensional data to adaptive dimension reduction for clustering high dimensional data: Converting page titles to lowercase) 
(No difference)

Latest revision as of 09:46, 30 August 2017
Contents
1. Introduction
Clustering methods such as the Kmeans and EM suffer from local minima problems. In high dimensional space, the cost function surface is very rugged and it is easy to get trapped somewhere close to the initial configurations. The conventional method is to try a number of initial values, and pick up the best one of the results.
The Paper proposes a approach to tackle this problem. The approach utilizes the idea of dimension reduction. In the paper, (i) they approach dimension reduction as a dynamic process that is adaptively adjusted and integrated with clustering process (ii) make effective use of cluster membership to connect reduced dimensional space and full dimension space.
The paper focuses on Kmeans and EM algorithms using mixture model of spherical Gaussian components, which retain identical model parameters in reduced lowdimensional subspace as in high dimensional space.
2. Effective Dimension for Clustering
The idea is to perform clustering in low dimensional subspaces. Low dimensional subspace is interpreted as containing relevant attributes( linear combinations of coordinates). The dimensionality [math]r[/math] of subspace should satisfy: [math] r \leq K1[/math] based on linear discriminant analysis. Thus the effective clustering dimensions for the K spherical Gaussians are spanned by the K centers [math] \mu_1, ...,\mu_k[/math], for which [math] r=K1 [/math]. We call the relevant dimensions passing through all [math]K[/math] centers to the [math]r[/math]dim subspace. The effective dimensionality of the relevant subspace could be less than [math]K1[/math]. This happens when the K cluster centers lie in a subspace with dimension [math]r\lt K1[/math]
3. EM in relevant subspace
The idea: the irrelevant dimensions can be integrated out, and the resulting marginal distribution follows the same Gaussian mixture functional form. Then it can freely move between subspace and original space. Knowing them in the subspace, we can directly infer the centers in the original space. Assume the following mixture model:
[math]p(x)=\pi_1g_1^d(x\mu_1)+...+\pi_kg_k^d(x\mu_k)[/math]
Where each component is a spherical Gaussian distribution,
[math]g_k^d(x)=\frac{1}{(\sqrt{2\pi}\sigma_k)^d} exp(\frac{x\mu_k^2}{2\sigma_k^2})[/math] and x, ,[math] \mu_k[/math] are vectors in ddim space. It can be denoted as [math] N^{(d)}(\mu_k,\sigma_k) [/math] . Two invariant properties of spherical Gaussian functions: (i) Remain invariant under any orthogonal coordinate rotation operation [math] R: x\leftarrow Rx: g_k^d(RxR\theta)=g_k^d(x\theta)[/math] (ii) coordinate translation(shift) operation L: [math] g_k^d(LxL\theta)=g_k^d(x\theta)[/math]
Theorem 1:
In Em clustering using spherical Gaussian mixture models in ddimensions, after integrating out irrelevant dimensions, the marginal probability becomes
[math] p(y)=\pi_1g_1^r(y\upsilon_1)+ \cdot \cdot \cdot +\pi_kg_k^r(y\upsilon_k) [/math]
Exactly the same type of Gaussian distribution as in rdim space.
4. Adaptive Dimension Reduction for EM
the centers (or centroids in the Kmeans) obtained in clustering in the rdim subspace can be uniquely traced back to the original ddim space by using the cluster membership of each data point. This observation is the basis of our ADREM clustering.
The cluster membership information is contained in the posterior probability [math]h_i^k[/math],The EM algorithm is as follows:
(i) Inialize model parameters [math] \pi_k, \upsilon_k, \sigma_k[/math]
(ii) compute [math] h_i^k [/math]
(iii) Update.
compute the number of points belonging to cluster [math] c_k : n_k=\sum_i h_i^k[/math]
update priors: [math] \pi_k=n_k/N [/math]
update centers: [math] \upsilon_k=\sum_i h_i^ky_i/n_k[/math]
update covariances:[math] \sigma_k=\sum_i h_i^ky_i\mu_k^2/rn_k[/math]
(iv) steps 2 and 3 are repeated until convergence
The complete ADREM algorithm
(i) Center the data . Rescale the data such that[math]\Sigma=I[/math]. Choose appropriate K as input parameter. Choose dimensionality r for reduced subspace. In general, r=K1 is recommended, but r=k or [math] r \lt K1[/math] are also appropriate
(ii) Do the first dimension reduction using PCA or any other methods, including random starts
(iii)Run EM in the rdim subspace to obtain clusters. Use cluster membership to construct cluster centroids in original space. Check convergence. If yes, go to step 5
(iv) Compute the new rdim subspace spanned by the K centroids using either SVD or QR basis Project data into this new subspace. Go to step 3
(v) Output results and converting posterior probabilities to discrete indicators. The relevant attributes are also identified.
5. Adaptive Dimension Reduction for Kmeans
Kmeans clustering can be viewed as a special case of EM with three simplifications (i) [math]\sigma_1 = \cdot \cdot \cdot =\sigma_k=\sigma[/math]; (ii) [math] \pi_1=\cdot \cdot \cdot =\pi_k[/math]; (iii) with [math]\sigma \rightarrow 0 [/math]
In Kmeans clustering, we use hard group assignment, while we use soft/probabilstic group assignment in mixture of Gaussians.
Thorem 2:
Suppose we somehow know the correct rdim relevant subspace defined by [math]R_r[/math]. Let [math]Y=R_r^TX=R_r^T(x_1,\cdot\cdot\cdot, x_n)[/math]and [math]C_{\upsilon} =[\upsilon_1, \cdot \cdot \cdot, \upsilon_k][/math] be K centroids in rdim subspace. Solve the Kmeans problem in rdim subspace,
[math] \underset{C_v}{\text{minimize}} J_r(Y,C_{\upsilon})[/math]
Use the cluster membership [math]H=(h_i^k) [/math] obtained to reconstruct the K centres [math]C_{\mu}^* = [ \mu_1^*, \cdot \cdot \cdot , \mu_k^*] [/math] in the full dimensional space. Then [math]C_{\mu}^*[/math] are the exact optimal solution to the fulldimension Kmeans problem.
By Theorem 2, we only need to find the relevant subspace, which is much easier than finding [math]C_\mu^*[/math] directly. This is the usefulness of Theorem2. The adaptive dimension reduction Kmeans is based on the theorem. The complete ADR Kmeans algorithm is identical to ADREM algorithm.
6. Experiments
They perform three experiments to assess the performance of the method.
The first experiment is based on synthetic data of 3 overlapping Gaussian clusters in 4dimensional space. It is shown that the method can successfully find the clusters.
The second experiment is done on a microarray gene expression dataset. The challenging feature of this dataset is the high dimensionality of the data. The mehtod is shown to cluster most of the points properly.
The last experiment is performed for internet newsgroup clustering. It is shown that method works better than a simple kmeans clustering in the original space.
7. Discussion
The algorithm in the paper is proposed to overcome the dilemma that the cluster centers in the lowdimension space using PCA or other dimension reduction methods are not the optimal cluster centers in the original space, while performing heuristic optimization methods such as Kmeans or EM in the original highdimensional space are easy to be trapped in the local minima. In the paper, the optimal subspace for clustering is defined as the K1 dimensional space spanned by K cluster centers assumed the cluster centers are optimal. An iterative algorithm of joint unsupervised dimension reduction and clustering is developed. There are two things that might require to be considered.
Question: While the standard EM algorithm is proved to be converged and has good statistic meaning, the proposed algorithm in the paper is not guaranteed to be converged and even if it is finally converged, the solution is not guaranteed to be close to global optimum.
Hi Cheng Huan:
As for the question mentioned above, I think the adaptive method can guarantee local optimum. Theorem 1) and 2) are the keys to the effectiveness of this adaptive method. Theorem 1) states that integrating out irrelevant dimensions derives the same type of Gaussian distribution in lower dimension, and thus it reduces the convergence of EM in high dimension space to the convergence in low dimension space. Similar idea for theorem 2). As the function become more smooth in lower dimension space, the adaptive method reduces the risk of trapping in local optimum that is not globally optimized. I am not sure why you conclude this method cannot guarantee local optimum. If you see my comment, could you please reply me your explanation of the divergence of this adaptive method? Thanks.
Victor Zikun Xu