stat441w18/summary 1
Random features for large scale kernel machines
Group members
Faith Lee
Jacov Lisulov
Shiwei Gong
Introduction and problem motivation
In classification problems, kernel methods are used for pattern analysis and require only user-specified kernel. Kernel methods can be thought of as instance-based methods where they learn the i-th training examples are "remembered" to learn for corresponding weights. Prediction on untrained examples are then treated with a similarity function, k (also called a kernel between this untrained example and each of the training inputs. A similarity function measures similarity between two objects. By conventional notation, we have that
[math]\displaystyle{ f(x; \alpha) = \sum_{i = 1}^{N} \alpha_i k(x, x_i) }[/math] where k is a kernel function, [math]\displaystyle{ k(x, x') \approx \sum_{j = i}^{D} Z(X; W_j)Z(X'; W_j) }[/math]
and [math]\displaystyle{ \alpha }[/math] are the corresponding weights.
An example of a kernel method is the support vector machine. Kernel methods provides a means of approximating a non-linear function or decision boundary. However, the problem of using kernel methods are that:
1) It scales poorly with the size of the dataset
2) Computationally expensive (kernel machines result in large matrices with entries of kernels operated on training points)
In this paper, the authors propose mapping of input data to a randomized low-dimensional feature space and then apply existing linear methods. The main goal is to reduce the bottleneck of kernel-based inference methods.
Methods
The Fourier transform of a function is given by definition as [math]\displaystyle{ \int_{-\infty}^{\infty} exp^{-2\pi i x \epsilon} dx }[/math].
The authors made use of the fact that for any kernel that is positive definite, its corresponding Fourier transform is also positive definite as well. For any shift-invariant kernel k(x-y), we can build a low-dimension randomized function z so that [math]\displaystyle{ \forall x, y }[/math] [math]\displaystyle{ k(x,y) = \lt \phi(x), \phi(y) \gt \approx Z(x)^TZ(y) }[/math] where the last term approximated represents the dot product of the featurized inputs.