learning Spectral Clustering, With Application To Speech Separation
Introduction
The paper <ref>Francis R. Bach and Michael I. Jordan, Learning Spectral Clustering,With Application To Speech Separation, Journal of Machine Learning Research 7 (2006) 1963-2001.</ref> presented here is about spectral clustering which makes use of dimension reduction and learning a similarity matrix that generalizes to the unseen datasets when spectral clustering is applied to them.
Clustering
Clustering refers to partition a given dataset into clusters such that data points in the same cluster are similar and data points in different clusters are dissimilar. Similarity is usually measured over distance between data points.
Formally stated, given a set of data points [math]\displaystyle{ X=\{{{\mathbf x}}_1,{{\mathbf x}}_2,\dots ,{{\mathbf x}}_P\} }[/math], we would like to find [math]\displaystyle{ K }[/math] disjoint clusters [math]\displaystyle{ {C{\mathbf =}\{C_k\}}_{k\in \{1,\dots ,K\}} }[/math] such that [math]\displaystyle{ \bigcup{C_k}=X }[/math], that optimizes a certain objective function. The dimensionality of data points is [math]\displaystyle{ D }[/math], and [math]\displaystyle{ X }[/math] can be represented as a matrix [math]\displaystyle{ {{\mathbf X}}_{D\times P} }[/math].The similarity matrix that measures the similarity between each pair of points is denoted by [math]\displaystyle{ {{\mathbf W}}_{P\times P} }[/math]. A classical similarity matrix for clustering is the diagonally-scaled Gaussian similarity, defined as
[math]\displaystyle{ \mathbf W(i,j)= \rm exp (-(\mathbf{x}_i-\mathbf{x}_j)^{\rm T}Diag(\boldsymbol{\alpha})(\mathbf{x}_i-\mathbf{x}_j) ) }[/math]
where [math]\displaystyle{ {\mathbf \boldsymbol{\alpha} }\in {{\mathbb R}}^D }[/math] is a vector of positive parameters, and [math]\displaystyle{ \rm Diag(\boldsymbol{\alpha} ) }[/math] denotes the [math]\displaystyle{ D\times D }[/math] diagonal matrix with diagonal [math]\displaystyle{ {\boldsymbol{\alpha} } }[/math].
Objective functions
Objective function for K-means clustering
Given the number of clusters [math]\displaystyle{ K }[/math], it aims to minimize an objective function (sum of within-cluster distance) over all clustering scheme [math]\displaystyle{ C }[/math].
[math]\displaystyle{ \mathop{\min_C} J=\sum^K_{k=1}\sum_{\mathbf x \in C_k}\|\mathbf x - \boldsymbol{\mu}\|^2 }[/math]