# inductive Kernel Low-rank Decomposition with Priors: A Generalized Nystrom Method

## Introduction

Low-rankness is an important structure widely exploited in machine learning. Low-rank matrix decomposition produces a compact representation of large matrices, which is the key to scaling up a great variety of kernel learning algorithms. However there are still some concerns with existing approaches. First, most of them are intrinsically unsupervised and only focus on numerical approximation of given matrices i.e. cannot incorporate prior knowledge. Second, many decomposition methods, the factorization can only be computed for samples available in the training stage, it difficult to generalize the decomposition to new samples.

This paper introduces a low-rank decomposition algorithm by generalizing the Nystrom method that incorporates side information. The novelty is to provide an interpretation of the matrix completion view of Nystrom method as a bilateral extrapolation of a dictionary kernel, and generalize it to incorporate prior information in computing improved low-rank decompositions. The author claims the two advantages of the method are its generative structure and linear complexity in sample size.

Nystrom method was originated from solving integral equations and was introduced to machine learning community by Williams et al.<ref> Williams, C. and Seeger, M. Using the Nystrom method to speed up kernel machine. Advances in Neural Information Processing System 13, 2001. </ref> Fowlkes et al. <ref> Fowlkes, C., Belongie, S. Chung, F., and Malik, J. Spectral grouping using Nystrom Method. IEEE Transactions on Pattern Analysis and Machine Intellgence, 26(2): 214- 225, 2004.

</ref>. Given a kernel function $k(.,.)$ and a sample set with underlying distribution $p(.)$, the Nystrom method aims at solving the following integral equation
$\int k(x,y)p(y)\Phi_i(y)dy = \lambda_i\Phi_i(x)$
Here $\phi_i(x)$ and $\lambda_i$ are the ith eigen function and eigen value of the operator $k(.,.)$ with regard to $p$. The idea is to draw a set of $m$ samples $Z$, called landmark points, from the underlying distribution and approximate the expectation with the empirical average as
$\frac{1}{m}\sum_{j=1}^{m}k(x,z_j)\Phi_i(z_j) = \lambda_i\Phi_i(x)$
, by choosing $x$ as $z_1, z_2,...,z_m$ as well, the followig eigenvalue decomposition can be obtained $W\Phi_i = \lambda_i\Phi_i$, where $W$ a $m$ by $m$ is the kernel matrix defined on landmark points, $\Phi_i$ is a m by 1 matrix and $\lambda_i$ are the ith eigenvector and eigenvalue of $W$. In practice, given a large dataset, the Nystrom method selects $m$ landmark points $Z$ with $m\lt \lt n$ and computes the eigenvalue decomposition of $W$. Then the eigenvectors of $W$ are extrapolated to the whole sample set. Te whole n by n kernel matrix $K$ can by implicitly reconstructed by
$K\approx EW^{\dagger}E^{T}$
where $W^{\dagger}$ is the pseudo-inverse, and E is the kernel matrix defined on the sample set and landmark points. The Nystrom method requires $O(mn)$ space and $O(m^2n)$time, which are linear in sample size.

## Bilateral Extrapolation of Dictionary Kernel

The ijth component of the kernel matrix is constructed as
$K_{ij} = E_iW^{\dagger}E_j^T$
Where, $E_i$ is the ith row of the extrapolation matrix $E$, i.e. the similarity between any two points $x_i$ and $x_j$ is constructed by first computing their respective similarities to the landmark set nd then modulated by the inverse of the similarities among the landmark points $W^{\dagger}$.

Proposition 1 Given m landmark points $Z$, use $K_{ij} = E_iW^{\dagger}E_j$to construct the similarity between any two samples,$x_i$ and $x_j$. Let $z_p$ and $z_q$ be the closest landmark point to $x_i$and $x_j$, respectively. Let $d_p=\| x_i - z_p\|$, and $d_p=\| x_j - z_q\|$. Let the kernel function $k(.,.)$ satisfy $k(x,y) − k(x,z) \le \eta \|y-z\|$, and $c = m\times \max k(.,.)$. Then the reconstructed similarity $K_{ij}$ and the pqth entry of the W will have the following relation
$|K-W| \le \sqrt{m}\eta(cd_p+cd_q+\sqrt{m}\eta d_p d_p)\|W^{\dagger}\|_F$

The inequality provide a bound on the distance between $K_{ij}$ and $W_{pq}$, this means that similarity matrix $W$ can serve as a dictionary kernel whose entries are extrapolated bilaterally onto any pairs of samples $(x_i, x_j)$ according to the proximity relation between landmark points and samples.

This kernel extrapolation view of the Nystrom method inspires us to generalize it to handle prior constraint in learning a low-rank kernel.

## Including Side Information

The quality of the dictionary can impact the whole kernel matrix, in the original Nystrom method the dictionary kernel W is simply the computed as the pairwise similarity between landmark points. The intuition is to incorporate side information to construct a appropriate dictionary kernel. Suppose we are given a set of labeled and unlabeled samples(semi-supervised learning), the task is to learn (the inverse of) a new dictionary kernel, denoted by S , subject to two considerations. First, the reconstructed kernel $ESE^T$ should preserve the structure of the original kernel K, since K encodes important pairwise relation between samples (incorporate unsupervised information). Second, the reconstructed kernel based on the labelled samples, $E_lSE^T_l$, should be consistent with the given side information (incorporate supervised information).

To satisfy both criteria the paper arrives at the following optimization problem:
$min_{S\in R^{m \times m}} \lambda \|S-S_0\|^2_F+\|E_lSE_l^T - K^*_l\|^2_F$ s.t. $S \succeq 0.$

Here, $S_0 = W^{\dagger}$ and $W^{\dagger}$ is from the standard Nystrom method. $K_l^*$ is the ideal kernel, it equals to 1 when $x_i$ and $x_j$ are in the same class, and 0 otherwise. The first term of the objective function corresponds to the first consideration and the second term of the objective corresponds to the second. The reason for choosing the euclidean norm is that minimizing the Euclidian distance is related to maximizing the alignment. We can then use the normalized kernel alignment score afterwards as an in- dependent measure to choose the hyper-parameter $\lambda$.

## Side Information as Grouping Constraints

The side information can used as grouping constraint and then be incorporated into the objective function. Given a set of grouping constraints denoted by $I$. Let $\Chi_I$ be the subset of samples with such constraints. Then define $T\in R^{\Chi_I\times\Chi_I}$ with $T_{ij} = 1$ if $(x_i,x_j) \in \Chi_I$ and 0 otherwise. Then the objective function becomes,
$min_{S\in R^{m \times m}} \lambda \|S-S_0\|^2_F+\|T\cdot (E_lSE_l^T) - K^*_l\|^2_F$ s.t. $S \succeq 0.$

## Optimization

Since both the objective function and the positive semidefinite constraints are convex, there exists a global optimal. The paper uses a gradient mapping strategy <ref name="Nemi1994"> Nemirovski, A. Efficient method in convex programming. Lecture Notes 1994.

</ref> to find the optimal solution. The method has a iterative gradient descent step and projection step. Given an initial solution $S^{(t)}$ we update it by $S^{(t+1)} = S^{(t)}+ \eta^{(t)}\nabla_{S^{(t)}}$ where $\nabla_{S ^{(t)}}$ is the gradient of the objective function at $S$.
$\nabla S = 2\lambda(S_0-S) +2E_l^T(E_lSE_l^T-K^*)E_l$
The step length $\eta^{(t)}$is determined by the Armijo Goldstein rule<ref name="Nemi1994">

Reference </ref>.

The descent step:

$B_{A}^*={\underset{B \succeq 0}{\text{arg min}}} \ tr({\nabla}_{S^{(t)}}B)+{\frac A 2} \| B-S^{(t)} \|_F^2$;

while $J(B_A^*)\gt J(S^{(t)})+tr({\nabla}_{S^(t)}(B_A^*-S^{(t)}))+{\frac A 2} \| B_A^*-S^{(t)} \|_F^2$

$\ \qquad$ $\ \qquad$ Increase A by a constant times;

$\ \qquad$ $\ \qquad$ $B_{A}^*={\underset{B \succeq 0}{\text{arg min}}} \ tr({\nabla}_{S^{(t)}}B)+{\frac A 2} \| B-S^{(t)} \|_F^2$;

end while

finish descent step.

After the descent step, we project the iterate $S^{t+1)}$ onto the set of positive semi-definite cones as follows
$S^{(t+1)} = U^{(t+1)}\Lambda_{+}^{(t+1)}(U^{(t+1)})^T$
where $U^{(t+1)}$ and $\Lambda^{(t+1)}$ are the eigenvectors and eigenvalues of $S^{(t+1)}$.

## Initialization

The paper proposes a closed-form initialization which helps us quickly locate the optimal solution.The idea is to drop the positive semi-definite constraint in the objective and compute the vanishing point of the gradient. which gives,
$\lambda S+E_l^TE_lSE_l^TE_l=E_l^TK_l^*E_l+\lambda S_0$
Then we have,
$S+PSP^T = Q$
where
$P=\frac{1}{\sqrt{\lambda}}(E_l^TE_l)$
$Q=S_0+\frac{1}{\lambda}E_l^TK^*_lE_l$

Suppose the diagonalization of $P$ is $P = U \Lambda U^T$, and define $S=U \tilde SU^T$, $Q= U\tilde QU^T$, then it can be written as

$U\tilde SU^T + U\Lambda \tilde S \Lambda U^T = U \tilde QU^T$
$\tilde S + \Lambda\tilde S\Lambda^T = \tilde Q$
Since $\Lambda$ is diagonal, this gives us $m^2$ equations
$\tilde S_{ij}+\Lambda_{ii}\Lambda_{jj} \tilde S_{ij} = Q_{ij}, 1 \le i, j \le m.$

Therefore we have a closed form solution of S, as $S=U \tilde S U^T$ where $[ \tilde S]_{ij} = \frac{ \tilde Q_{ij}}{1+\Lambda_{ii}\Lambda_{jj}}$

After computing $S$, we then project it onto the set of positive semi-definite cones. Such an initial solution can be deemed as the closest positive semi-definite matrix to the unconstrained version of the objective function.

## Landmark Selection

Selection of landmark points $Z$ in Nystrom method can greatly affect its performance. The authors used the k-mean based sampling scheme by Zhang & Kwok<ref> Zhang, K. and Kwok, J. Clustered Nystrom method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks 21:1576-1587, 2010 </ref> . In which, the authors first use k-mean clustering to group the data, and pick the centroids of each cluster as the landmarks for the Nystrom method.

## Complexities

The space complexity of the proposed algorithm is $O(mn)$, where n is sample size and m the number of landmark points. Computationally, it requires repeated eigenvalue decomposition of $m \times m$ matrices, and a single multiplication between the $n \times m$ extrapolation E and the $m \times m$ dictionary kernel S. The overall complexity is $O(m^2n)+O(tlog(\mu_max)m^3)$ where t is the number of gradient mapping iterations, and $\mu_{max}$ is the maximum eigenvalue of the Hessian. The algorithm has linear time and space complexity.

## Selecting Hyper-parameter

The hyper-parameter $\lambda$ can be difficult to choose if the side information is limited. The authors propose a heuristic to choose it. The two residuals $S_0 -S$ and $E_lSE_l^T - K_l^*$ of the objective function are additive and requires a tradeoff parameters $\lambda$. The normalized kernel alignment (NKA) <ref name="Cortes2010"> Cortes, C., Mohri, M. and Rostamizadeh, A. Two stage learning kernel algorithm. International Conference on Machine Learning, 2010.

</ref> between kernel matrices,
$\rho[K_1, K_2] = \frac { \langle K_{1c} K_{2c}^T\rangle_F}{ \|K_{1c}\|_F \|K_{2c}\|_F }$
where $K_{1c}$ is double-centralized $K_{1}$. The NKA score alway has magnitude that is smaller than 1 and it is independent of the scale of the solution is multiplicative by nature. Let $S(\lambda)$ be the optimum of the objective function for a fixed $\lambda$ then choose the best $\lambda$ that maximize the following criterion:
$\lambda^* = \underset{\lambda\in G} {arg\,max}\rho[S(\lambda),S_0]\times\rho[E_lS(\lambda)E^T_l, K^*_l ]$
G is the set of candidate $\lambda$ 's. The above criterion is an information criterion to measure the quality of solution, and it has several nice properties:

(1) It is scale invariant, so it does not need any additional tuning parameters.

(2). The first terms measures the closeness between $S$ and $S_0$, related to unsupervised structures of kernel matrix; the second term is on the closeness between $E_lSE^T_l$ and $K_l^*$, related to side information. This criteria faithfully reflects what the objective function optimizes but numerically different.

(3). If the value of second term (the NKA) is high, it suggests it is a big chance that there exists a good predictor <ref name="Cortes2010"> Cortes, C., Mohri, M. and Rostamizadeh, A. Two stage learning kernel algorithm. International Conference on Machine Learning, 2010. </ref>.

(4). To obtain the best $\lambda$ does not require the validation (or cross-validation), which is good for small sample problems

## Experiments

The paper compares 7 algorithms on learning low-rank kernel: (1) Nystrom: standard Nstrom method (2) CSI: Choleskey with Side information <ref name="Bach2005"> Bach, F.R and Jordan, M.I. Kernel independent component analysis. International Conference of Machine Learning, 2005. </ref>; (3) Cluster: cluster kernel <ref name="Chapelle2003"> Chapelle, O., Weston, J., Scholkopf, B. Cluster kernels for semi-supervised learning. Advances in Neural Information Processing System 15, 2003. </ref>; (4) Spectral: non-parametric spectral graph kernel <ref name="Zhu2004"> Zhu, X., Kandola, J., Ghahbramani, Z., and Lafferty,J. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems 16, 2004 </ref>; (5) TSK: two stage kernel learning algorithm <ref name="Cortes2010"> Reference </ref>; (6) Breg: low-rank kernel learning with Bregman divergence <ref name="Kulis2009"> Kulis, B., Sustik, M.A.,and Dhillon, I.S. Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research, 10:341-376, 2009 </ref> (7) Proposed method.

The benchmark datasets from the SSL data set and the libsvm data. For each data set, the labelled data are picked randomly. Gaussian kernel $K(x_1, x_2) = exp(-\|x_1-x_2\|^2/b)$ is used, the kernel width is chosen as the average pairwise squared distances between samples. Most algorithms can learn the $n \times n$ low rank kernel matrix on labeled and unlabeled samples in the form of $K = A^TA$, which is fed into SVM for classification. The resultant problem will be a linear SVM using A as training/testing samples, note that A does not need to be known, as long as we know the form of K.

On most data sets, algorithms using labels in kernel learning outperform the baseline algorithm (method 1), indicating the value of side information. The proposed approach is competitive with stat-of-the-art kernel learning algorithms with less memory consumption.

Figure1 examines the alignment score used to choose the hyper-parameter λ in Figure 1, the score correlates nicely with the classification accuracy. File:hssunFigure1.png