inductive Kernel Low-rank Decomposition with Priors: A Generalized Nystrom Method

From statwiki
Jump to navigation Jump to search

Introduction

Low-rankness is an important structure widely exploited in machine learning. Low-rank matrix decomposition produces a compact representation of large matrices, which is the key to scaling up a great variety of kernel learning algorithms. However there are still some concerns with existing approaches. First, most of them are intrinsically unsupervised and only focus on numerical approximation of given matrices i.e. cannot incorporate prior knowledge. Second, many decomposition methods, the factorization can only be computed for samples available in the training stage, it difficult to generalize the decomposition to new samples.

This paper introduces a low-rank decomposition algorithm by generalizing the Nystrom method that incorporates side information. The novelty is to provide an interpretation of the matrix completion view of Nystrom method as a bilateral extrapolation of a dictionary kernel, and generalize it to incorporate prior information in computing improved low-rank decompositions. The author claims the two advantages of the method are its generative structure and linear complexity in sample size.

Nystrom method was originated from solving integral equations and was introduced to machine learning community by Williams et al.<ref> Williams, C. and Seeger, M. Using the Nystrom method to speed up kernel machine. Advances in Neural Information Processing System 13, 2001. </ref> Fowlkes et al. <ref> Fowlkes, C., Belongie, S. Chung, F., and Malik, J. Spectral grouping using Nystrom Method. IEEE Transactions on Pattern Analysis and Machine Intellgence, 26(2): 214- 225, 2004.

</ref>. Given a kernel function [math]\displaystyle{ k(.,.) }[/math] and a sample set with underlying distribution [math]\displaystyle{ p(.) }[/math], the Nystrom method aims at solving the following integral equation

[math]\displaystyle{ \int k(x,y)p(y)\Phi_i(y)dy = \lambda_i\Phi_i(x) }[/math]

Here [math]\displaystyle{ \phi_i(x) }[/math] and [math]\displaystyle{ \lambda_i }[/math] are the ith eigen function and eigen value of the operator [math]\displaystyle{ k(.,.) }[/math] with regard to [math]\displaystyle{ p }[/math]. The idea is to draw a set of [math]\displaystyle{ m }[/math] samples [math]\displaystyle{ Z }[/math], called landmark points, from the underlying distribution and approximate the expectation with the empirical average as

[math]\displaystyle{ \frac{1}{m}\sum_{j=1}^{m}k(x,z_j)\Phi_i(z_j) = \lambda_i\Phi_i(x) }[/math]

, by choosing [math]\displaystyle{ x }[/math] as [math]\displaystyle{ z_1, z_2,...,z_m }[/math] as well, the followig eigenvalue decomposition can be obtained [math]\displaystyle{ W\Phi_i = \lambda_i\Phi_i }[/math], where [math]\displaystyle{ W }[/math] a [math]\displaystyle{ m }[/math] by [math]\displaystyle{ m }[/math] is the kernel matrix defined on landmark points, [math]\displaystyle{ \Phi_i }[/math] is a m by 1 matrix and [math]\displaystyle{ \lambda_i }[/math] are the ith eigenvector and eigenvalue of [math]\displaystyle{ W }[/math]. In practice, given a large dataset, the Nystrom method selects [math]\displaystyle{ m }[/math] landmark points [math]\displaystyle{ Z }[/math] with [math]\displaystyle{ m\lt \lt n }[/math] and computes the eigenvalue decomposition of [math]\displaystyle{ W }[/math]. Then the eigenvectors of [math]\displaystyle{ W }[/math] are extrapolated to the whole sample set. Te whole n by n kernel matrix [math]\displaystyle{ K }[/math] can by implicitly reconstructed by

[math]\displaystyle{ K\approx EW^{\dagger}E^{T} }[/math]

where [math]\displaystyle{ W^{\dagger} }[/math] is the pseudo-inverse, and E is the kernel matrix defined on the sample set and landmark points. The Nystrom method requires [math]\displaystyle{ O(mn) }[/math] space and [math]\displaystyle{ O(m^2n) }[/math]time, which are linear in sample size.

Generalized Nystrom Low-rank Decomposition

Bilateral Extrapolation of Dictionary Kernel

Including Side Information

Side Information as Grouping Constraints

Optimization

Initialization

Landmark Selection

Selection of landmark points [math]\displaystyle{ Z }[/math] in Nystrom method can greatly affect its performance. The authors used the k-mean based sampling scheme by Zhang & Kwok<ref> Zhang, K. and Kwok, J. Clustered Nystrom method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks 21:1576-1587, 2010 </ref> . In which, the authors first use k-mean clustering to group the data, and pick the centroids of each cluster as the landmarks for the Nystrom method.

Complexities

The space complexity of the proposed algorithm is [math]\displaystyle{ O(mn) }[/math], where n is sample size and m the number of landmark points. Computationally, it requires repeated eigenvalue decomposition of [math]\displaystyle{ m \times m }[/math] matrices, and a single multiplication between the [math]\displaystyle{ n \times m }[/math] extrapolation E and the [math]\displaystyle{ m \times m }[/math] dictionary kernel S. The overall complexity is [math]\displaystyle{ O(m^2n)+O(tlog(\mu_max)m^3) }[/math] where t is the number of gradient mapping iterations, and [math]\displaystyle{ \mu_{max} }[/math] is the maximum eigenvalue of the Hessian. The algorithm has linear time and space complexity.

Selecting Hyper-parameter

The hyper-parameter [math]\displaystyle{ \lambda }[/math] can be difficult to choose if the side information is limited. The authors propose a heuristic to choose it. The two residuals [math]\displaystyle{ S_0 -S }[/math] and [math]\displaystyle{ E_lSE_l^T - K_l^* }[/math] of the objective function are additive and requires a tradeoff parameters [math]\displaystyle{ \lambda }[/math]. The normalized kernel alignment (NKA) <ref name="Cortes2010"> Cortes, C., Mohri, M. and Rostamizadeh, A. Two stage learning kernel algorithm. International Conference on Machine Learning, 2010.

</ref> between kernel matrices,

[math]\displaystyle{ \rho[K_1, K_2] = \frac { \langle K_{1c} K_{2c}^T\rangle_F}{ \|K_{1c}\|_F \|K_{2c}\|_F } }[/math]

where [math]\displaystyle{ K_{1c} }[/math] is double-centralized [math]\displaystyle{ K_{1} }[/math]. The NKA score alway has magnitude that is smaller than 1 and it is independent of the scale of the solution is multiplicative by nature. Let [math]\displaystyle{ S(\lambda) }[/math] be the optimum of the objective function for a fixed [math]\displaystyle{ \lambda }[/math] then choose the best [math]\displaystyle{ \lambda }[/math] as follows

[math]\displaystyle{ \lambda^* = \underset{\lambda\in G} {arg\,max}\rho[S(\lambda),S_0]\times\rho[E_lS(\lambda)E^T_l, K^*_l ] }[/math]

G is the set of candidate [math]\displaystyle{ \lambda }[/math] 's. The first terms measures the closeness between [math]\displaystyle{ S }[/math] and [math]\displaystyle{ S_0 }[/math], related to unsupervised structures of kernel matrix; the second term is on the closeness between [math]\displaystyle{ E_lSE^T_l }[/math] and [math]\displaystyle{ K_l^* }[/math], related to side information. This criteria faithfully reflects what the objective function optimizes but numerically different. This is an information criterion to measure the quality of solution.

Experiments

The paper compares 7 algorithms on learning low-rank kernel: (1) Nystrom: standard Nstrom method (2) CSI: Choleskey with Side information <ref name="Bach2005"> Bach, F.R and Jordan, M.I. Kernel independent component analysis. International Conference of Machine Learning, 2005. </ref>; (3) Cluster: cluster kernel <ref name="Chapelle2003"> Chapelle, O., Weston, J., Scholkopf, B. Cluster kernels for semi-supervised learning. Advances in Neural Information Processing System 15, 2003. </ref>; (4) Spectral: non-parametric spectral graph kernel <ref name="Zhu2004"> Zhu, X., Kandola, J., Ghahbramani, Z., and Lafferty,J. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems 16, 2004 </ref>; (5) TSK: two stage kernel learning algorithm <ref name="Cortes2010"> Reference </ref>; (6) Breg: low-rank kernel learning with Bregman divergence <ref name="Kulis2009"> Kulis, B., Sustik, M.A.,and Dhillon, I.S. Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research, 10:341-376, 2009 </ref> (7) Proposed method.

The benchmark datasets from the SSL data set and the libsvm data. For each data set, the labelled data are picked randomly. Gaussian kernel [math]\displaystyle{ K(x_1, x_2) = exp(-\|x_1-x_2\|^2/b) }[/math] is used, the kernel width is chosen as the average pairwise squared distances between samples. Most algorithms can learn the [math]\displaystyle{ n \times n }[/math] low rank kernel matrix on labeled and unlabeled samples in the form of [math]\displaystyle{ K = A^TA }[/math], which is fed into SVM for classification. The resultant problem will be a linear SVM using A as training/testing samples, note that A does not need to be known, as long as we know the form of K.

On most data sets, algorithms using labels in kernel learning outperform the baseline algorithm (method 1), indicating the value of side information. The proposed approach is competitive with stat-of-the-art kernel learning algorithms with less memory consumption.

File:hssunTable1.png

Figure1 examines the alignment score used to choose the hyper-parameter λ in Figure 1, the score correlates nicely with the classification accuracy. File:hssunFigure1.png

References

<references />