inductive Kernel Low-rank Decomposition with Priors: A Generalized Nystrom Method

From statwiki
Jump to navigation Jump to search

Introduction

Low-rankness is an important structure widely exploited in machine learning. Low-rank matrix decomposition produces a compact representation of large matrices, which is the key to scaling up a great variety of kernel learning algorithms. However there are still some concerns with existing approaches. First, most of them are intrinsically unsupervised and only focus on numerical approximation of given matrices i.e. cannot incorporate prior knowledge. Second, many decomposition methods, the factorization can only be computed for samples available in the training stage, it difficult to generalize the decomposition to new samples.

This paper introduces a low-rank decomposition algorithm by generalizing the Nystrom method that incorporates side information. The novelty is to provide an interpretation of the matrix completion view of Nystrom method as a bilateral extrapolation of a dictionary kernel, and generalize it to incorporate prior information in computing improved low-rank decompositions. The author claims the two advantages of the method are its generative structure and linear complexity in sample size.

Nystrom method was originated from solving integral equations and was introduced to machine learning community by Williams et al.<ref> Williams, C. and Seeger, M. Using the Nystrom method to speed up kernel machine. Advances in Neural Information Processing System 13, 2001. </ref> Fowlkes et al. <ref> Fowlkes, C., Belongie, S. Chung, F., and Malik, J. Spectral grouping using Nystrom Method. IEEE Transactions on Pattern Analysis and Machine Intellgence, 26(2): 214- 225, 2004. </ref>. Given a kernel function [math]\displaystyle{ k(.,.) }[/math] and a sample set with underlying distribution [math]\displaystyle{ p(.) }[/math], the Nystrom method aims at solving the following integral equation [math]\displaystyle{ \int k(x,y)p(y)\Phi_i(y)dy = \lambda_i\Phi_i(x) }[/math]. Here [math]\displaystyle{ \phi_i(x) }[/math] and [math]\displaystyle{ \lambda_i }[/math] are the ith eigen function and eigen value of the operator [math]\displaystyle{ k(.,.) }[/math] with regard to [math]\displaystyle{ p }[/math]. The idea is to draw a set of [math]\displaystyle{ m }[/math] samples [math]\displaystyle{ Z }[/math], called landmark points, from the underlying distribution and approximate the expectation with the empirical average as [math]\displaystyle{ \frac{1}{m}\sum_{j=1}^{m}k(x,z_j)\Phi_i(z_j) = \lambda_i\Phi_i(x) }[/math], by choosing [math]\displaystyle{ x }[/math] as [math]\displaystyle{ z_1, z_2,...,z_m }[/math] as well, the followig eigenvalue decomposition can be obtained [math]\displaystyle{ W\Phi_i = \lambda_i\Phi_i }[/math], where [math]\displaystyle{ W }[/math] a [math]\displaystyle{ m }[/math] by [math]\displaystyle{ m }[/math] is the kernel matrix defined on landmark points, [math]\displaystyle{ \Phi_i }[/math] is a m by 1 matrix and [math]\displaystyle{ \lambda_i }[/math] are the ith eigenvector and eigenvalue of [math]\displaystyle{ W }[/math]. In practice, given a large dataset, the Nystrom method selects [math]\displaystyle{ m }[/math] landmark points [math]\displaystyle{ Z }[/math] with [math]\displaystyle{ m\lt \lt n }[/math] and computes the eigenvalue decomposition of [math]\displaystyle{ W }[/math]. Then the eigenvectors of [math]\displaystyle{ W }[/math] are extrapolated to the whole sample set. Te whole n by n kernel matrix [math]\displaystyle{ K }[/math] can by implicitly reconstructed by [math]\displaystyle{ K\approx EW^{\dagger}E^{T} }[/math] where [math]\displaystyle{ W^{\dagger} }[/math] is the pseudo-inverse, and E is the kernel matrix defined on the sample set and landmark points. The Nystrom method requires [math]\displaystyle{ O(mn) }[/math] space and [math]\displaystyle{ O(m^2n) }[/math]time, which are linear in sample size.

Generalized Nystrom Low-rank Decomposition

References

<references />