kernel Spectral Clustering for Community Detection in Complex Networks
Abstract--This paper proposes a kernel spectral clustering approach for community detection in unweighted networks. The authors employ the primal-dual framework and make use of out-of-sample extension. They also propose a method to extract from a network a subgraph representative for the overall community structure. The commonly used modularity statistic is used as a model selection procedure. The effectiveness of the model is demonstrated through synthetic networks and benchmark real network data.
Backgrounds
Network
A network (graph) consists of a set of vertices or nodes and a collection of edges that connect pairs of nodes. A way to represent a network with N nodes is to use a similarity matrix S, which is an [math]\displaystyle{ N\times N }[/math] matrix. This paper only deals with unweighted networks, where the components of S is 1 or 0. The matrix S is called adjacency matrix and [math]\displaystyle{ S_{ij}=1 }[/math] if there is an edge connecting nodes i and j; otherwise, [math]\displaystyle{ S_{ij}=0 }[/math]. The degree matrix D is defined as a diagonal matrix with diagonal entries [math]\displaystyle{ d_i=\sum{_{j}}S_{ij} }[/math] indicating the degree of node i, i.e. the number of edges connected to node i.
Community Detection
The structure of networks sometimes reveals a high degree of organization. Nodes with similar properties/behaviors are more likely to be linked together and tend to form modules. Discovering such modules is called community detection, which is a hot topic in network data analysis. Moreover, once communities have been detected, the roles of the nodes in each community can be further investigated. The community structure of a graph can be used to give a compact visualization of large networks, i.e. one community can be treated as a big node. This kind of representation is called a supernetwork. There are other studies which use probabilistic generative models to model and detect community (such as <ref> Gopalan, Prem, Sean Gerrish, Michael Freedman, David M. Blei, and David M. Mimno. "Scalable inference of overlapping communities." In Advances in Neural Information Processing Systems, pp. 2258-2266. 2012.</ref> and <ref> Airoldi, Edoardo M., David M. Blei, Stephen E. Fienberg, and Eric P. Xing. "Mixed membership stochastic blockmodels." The Journal of Machine Learning Research 9 (2008): 1981-2014.</ref>)
Introduction
Community detection is indeed a clustering problem. Spectral clustering method is a standard technique for clustering, which is based on the eigendecomposition of a Laplacian matrix. A spectral clustering framework as a weighted kernel PCA with primal and dual representations is proposed by <ref>C. Alzate and J. A. K. Suykens, Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 335-347, 2010 </ref>. The main advantage is the extension of the clustering model to out-of-sample nodes. The clustering model can be trained on a small subset of the network and then be applied to the rest of the graph. In other words, it is baded on a predictive model previously learned and can be used for online clustering of huge growing networks. This is new to the field of community detection on networks. Other contributions of the paper includes the following two points. First, a criterion based on Modularity <ref>M. E. J. Newman, Modularity and community structure in networks, Proc. Natl. Acad. Sci. USA, vol. 103, no. 23, pp. 8577-8582, 2006</ref> is used for parameter and model selection. Second, regarding the selection of a small representative subgraph as training set, a method based on Expansion factor <ref>A. S. Maiya and T. Y. Berger-Wolf, Sampling community structure, in Proc. 19th ACM Intl. Conference on the World Wide Web, 2010.</ref> is proposed.
Kernel Spectral Clustering Model
General picture
The classical spectral clustering involves the study of the eigenspectrum of graph Laplacian matrices, i.e. L=D-S. For kernel spectral clustering model, a graph over the network needs to be build up in order to describe the similarity among the nodes in the kernel-based framework.
Primal-dual formulation
The kernel spectral clustering model is described by a primal-dual formulation. Suppose we have a network with N nodes and an adjacency matrix S. Let [math]\displaystyle{ x_i }[/math] denote the i-th row/column of S. Given Ntr training nodes [math]\displaystyle{ \{x_1, x_2, \ldots, x_{N_{tr}}\}\subset\mathbb{R}^N }[/math] and the number of communities k, the primal problem of spectral clustering via weighted kernel PCA is formulated as follows:
[math]\displaystyle{ \underset{w^{(l)},e^{(l)},b_l}{\operatorname{min}}\frac{1}{2}\sum_{l=1}^{k-1}w^{(l)^T}w^{(l)}-\frac{1}{2N}\sum_{l=1}^{k-1}r_le^{(l)^T}D_{\Omega}^{-1}e^{(l)} }[/math]
where
[math]\displaystyle{ e^{(l)}=\Phi w^{(l)}+b_l 1_{N_{tr}} }[/math] is the projection vector with length Ntr, l=1, 2,..., k-1 indicates the number of score variables needed to encode the k clusters;
[math]\displaystyle{ \Phi=[\phi(x_1),...,\phi(x_{N_{tr}})] }[/math] is the [math]\displaystyle{ N_{tr}\times d_h }[/math] feature matrix and [math]\displaystyle{ \phi:\mathbb{R}^N\rightarrow\mathbb{R}^{d_h} }[/math] is a mapping to a high-dimensional feature space;
[math]\displaystyle{ \{b_l\} }[/math] are bias terms;
[math]\displaystyle{ D_{\Omega}^{-1}\in \mathbb{R}^{N_{tr}\times N_{tr}} }[/math] is the inverse of the degree matrix associated to the kernel matrix [math]\displaystyle{ \Omega }[/math] (explained later);
[math]\displaystyle{ r_l \in \mathbb{R}^+ }[/math] are regularization constant.
The dual problem related to this primal formulation is:
[math]\displaystyle{ D_{\Omega}^{-1}M_D \Omega \alpha^{(l)}=\lambda_l \alpha^{(l)} }[/math]
where
[math]\displaystyle{ \Omega }[/math] is the kernel matrix with ij-th entry [math]\displaystyle{ \Omega_{ij}=K(x_i,x_j)=\phi(x_i)^T\phi(x_j) }[/math];
[math]\displaystyle{ D_{\Omega} }[/math] is the diagonal matrix with diagonal entries [math]\displaystyle{ d_i^{\Omega}=\sum\limits_{j}\Omega_{ij} }[/math];
[math]\displaystyle{ M_D }[/math] is a centering matrix defined as [math]\displaystyle{ M_D=I_{N_{tr}}-(1/1_{N_{tr}}^T D_{\Omega}^{-1}1_{N_{tr}})(1_{N_{tr}}1_{N_{tr}}^T D_{\Omega}^{-1}) }[/math];
[math]\displaystyle{ \{\alpha^{(l)}\} }[/math] are dual variables.
The kernel function [math]\displaystyle{ K:\mathbb{R}^N \times \mathbb{R}^N \rightarrow \mathbb{R} }[/math] plays the role of the similarity function of the network. The community kernel <ref>Y. Kang and S. Choi, Kernel PCA for community detection, in Business Intelligence Conference, 2009</ref> is proposed to build up the similarity matrix of the graph. The kernel [math]\displaystyle{ K(x_i, x_j) }[/math] is defined as the number of edges connecting the common neighbors of nodes i and j, i.e.
Encoding/decoding scheme
In the ideal case of k well separated clusters and properly chosen kernel parameters, the matrix [math]\displaystyle{ D_{\Omega}^{-1} M_D \Omega }[/math] has k-1 peicewise constant eigenvectors with eigenvalue 1.The codebook [math]\displaystyle{ \mathcal{CB}=\{c_p\}_{p=1}^k }[/math] can be obtained in the training process from the row of the binarized projected variables matrix for training data [math]\displaystyle{ [sign(e^{(1)}),\ldots,sign(e^{(k)})] }[/math]. The cluster indicators for the out-of-sample points is by obtained by [math]\displaystyle{ sign(e_{test}^{(l)})=sign(\Omega_{test}\alpha^{(l)}+b_l 1_{N_{test}}) }[/math].
Model Selection Criterion
Modularity is a quality function to evaluate the community structure of networks. The idea is to compare the actual density of edges with the density of a random graph. Modularity can be either positive or negative, with positive high values indicating the possible presence of a strong community structure. The modularity is defined as
[math]\displaystyle{ Q=\frac{1}{2m}\sum_{ij}(S_{ij}-\frac{d_id_j}{2m})\delta_{ij} }[/math]
where di is the degree of node i; m is the overall degree of the network and the Kronecker delta [math]\displaystyle{ \delta_{ij} }[/math] indicates whether or not node i and j belong to the same community. The model with the highest Modularity value will be selected.
It can be shown that to find the highest Modularrity value is equivalent to the following optimization problem:
[math]\displaystyle{ \underset{X}{\max}\left\{ tr(X^{T}MX) \right\}, s.t. X^{T}X=D^{M} }[/math]
where
[math]\displaystyle{ M=S-\frac{1}{2m}dd^{T} }[/math] is the Modularity matrix or Q-laplacian
[math]\displaystyle{ d=[d_{1},\cdots,d_{N}] }[/math] indicates the vector of the degrees of each node.
[math]\displaystyle{ D^{M} \in \mathbb{R}^{k \times k} }[/math] is a diagonal matrix with the i-th diagonal entry being the number of nodes in the i-th cluster.
[math]\displaystyle{ X }[/math] represents the cluster indicator matrix
Selecting a Representative Subgraph
The expansion factor (EF) is used to select the subgraph.
A greedy strategy for optimizing EF is provided in the paper:
Algorithm EF
Input:
network of [math]\displaystyle{ N }[/math] nodes [math]\displaystyle{ \mathcal{V}=\left\{ n_{i} \right\}^{N}_{i=1} }[/math](represented as a [math]\displaystyle{ N \times N }[/math] adjacency matrix [math]\displaystyle{ A }[/math])
size of subgraph [math]\displaystyle{ m }[/math]
Output: active set of [math]\displaystyle{ m }[/math] selected nodes
1). select randomly an initial subgraph [math]\displaystyle{ \mathcal{G}=\left\{ n_{j} \right\}^{m}_{j=1} \subset \mathcal{V} }[/math]
2). compute [math]\displaystyle{ EF(\mathcal{G}) }[/math]
3). randomly pick two nodes as [math]\displaystyle{ n_{*}\in \mathcal{V} }[/math] and [math]\displaystyle{ n_{+} \in \mathcal{V}-\mathcal{G} }[/math]
4). let [math]\displaystyle{ m }[/math]
5). if [math]\displaystyle{ EF(\mathcal{W})\gt EF(\mathcal{G}) }[/math], swap([math]\displaystyle{ \left\{ n_{*} \right\},\left\{ n_{+} \right\} }[/math])
6). Repeat the above procedure untill change in EF value is too small (comparing to a threshold specified by the user)
7). reture the final result [math]\displaystyle{ \mathcal{G} }[/math]
The time for selection depends on the size of the entire network [math]\displaystyle{ N }[/math] and its density, the chosen size [math]\displaystyle{ m }[/math], and the threshold [math]\displaystyle{ \epsilon }[/math].
References
<references/>