paper 13

From statwiki
Jump to navigation Jump to search

Random mapping (Random Projection)

For vector [math]\displaystyle{ x \in R^{N} }[/math] and a Random matrix [math]\displaystyle{ R }[/math], [math]\displaystyle{ Rx=y }[/math] is called Random map (projection) of [math]\displaystyle{ x }[/math], where [math]\displaystyle{ y\in R^{M} }[/math]. [math]\displaystyle{ y }[/math] can be written as
[math]\displaystyle{ y= \sum_{i}x_{i}r_{i} }[/math]
where [math]\displaystyle{ r_{i}'s }[/math] are the columns of [math]\displaystyle{ R }[/math]. In Random mapping orthogonal directions are replaced by random directions
By definition the number of columns of matrix [math]\displaystyle{ R }[/math] called the Number of Random projections ([math]\displaystyle{ M }[/math]) required to project a K-dimensional manifold from [math]\displaystyle{ R^{N} }[/math] into [math]\displaystyle{ R^{M} }[/math]
For Large [math]\displaystyle{ N }[/math] applying adaptive (data dependent) algorithms like Isomap, LLE, etc are too much costly and not feasible so, it is reasonable to use random projection (non-adaptive method)to map data into lower dimension ([math]\displaystyle{ M }[/math]) and then apply clustering algorithms and also other dimensionality reduction algorithms, because Random Mapping almost preserves the metric structure. Specially, If [math]\displaystyle{ R }[/math] is orthonormal then similarity under random mapping is exactly preserved.
For example if Cosine of the angle made by two vectors is the measure for their similarity, suppose [math]\displaystyle{ y_{1} }[/math] and [math]\displaystyle{ y_{2} }[/math] are the random maps of [math]\displaystyle{ x_{1} }[/math] and [math]\displaystyle{ x_{2} }[/math] then we have
[math]\displaystyle{ y^{T}_{1}y_{2}=x^{T}_{1}R^{T}Rx_{2} }[/math]
where [math]\displaystyle{ R }[/math] can be written as
[math]\displaystyle{ R^{T}R=I+\epsilon }[/math] in which

[math]\displaystyle{ \epsilon_{ij}= \left\{\begin{matrix} r^{T}_{i}r_{j} & \text{if } j \neq i \\ 0 & \text{if} i=j \end{matrix}\right. }[/math]
now for special case where [math]\displaystyle{ R }[/math] is orthonormal,
[math]\displaystyle{ R^{T}R=I }[/math]
since [math]\displaystyle{ \epsilon_{ij}=r^{T}_{i}r_{j}=0 }[/math]
so
[math]\displaystyle{ y^{T}_{1}y_{2}=x^{T}_{1}x_{2} }[/math]

As an interesting result about random matrix [math]\displaystyle{ R }[/math] it can be shown that when [math]\displaystyle{ M }[/math] (reduced dimension) is large then [math]\displaystyle{ \epsilon_{ij} }[/math] are independently distributed as Normal distribution wit mean [math]\displaystyle{ 0 }[/math] and Variance [math]\displaystyle{ 1/M }[/math]

Estimating The Intrinsic Dimension

Intrinsic Dimension ([math]\displaystyle{ K }[/math]) is an important input for almost all metric learning algorithms. Grossberger-Proccacia (GP) algorithm can help us to estimate [math]\displaystyle{ K }[/math] using [math]\displaystyle{ n }[/math] sample points selected from [math]\displaystyle{ K }[/math]-dimensional manifold embedded in [math]\displaystyle{ R^{N} }[/math]. Define

[math]\displaystyle{ C_{n}(r)=\frac{1}{n(n-1)} \sum_{i\neq j}I \left\|x_{i}-x_{j}\right\|\lt r }[/math]

which basically measures the probability of a random pairs of points havinng distances less than [math]\displaystyle{ r }[/math]. Then estimate of [math]\displaystyle{ K }[/math] is obtained the following formula (known as the Scale-dependent Correlation Dimension of [math]\displaystyle{ X=(x_{1},x_{2},...,x_{n}) }[/math]):

[math]\displaystyle{ \hat{D}_{corr}(r_{1},r_{2})=\frac{log(C_{n}(r_{1}))-log(C_{n}(r_{2}))}{log(r_{1})-log(r_{2})} }[/math]

when [math]\displaystyle{ r_{1} }[/math] and [math]\displaystyle{ r_{2} }[/math] are selected such that they cover the biggest range of graph at which [math]\displaystyle{ C_{n}(r) }[/math] is linear, GP gives the best estimate. another point is that [math]\displaystyle{ \widehat{K} }[/math] obtained GP algorithm based on finite sample points is Biased estimate.

The reason why the above formula works may be explained as follows: if the intrinsic dimensionality is 1 the number of points within a (1-dim) sphere around a given point would be proportional to [math]\displaystyle{ r }[/math]. If the intrinsic dimensionality is 2 and the points are (at least locally) distributed on a plane, then the number of points within a (2-dim) sphere around a given point would be proportional to [math]\displaystyle{ \pi r^2 }[/math], i.e., the area of the corresponding circle. Similarly in the case of a 3-dim sphere we will have proportionality to [math]\displaystyle{ 4/3 \pi r^3 }[/math], and so on. So we if we take logarithm of this function, we expect the number of intrinsic dimensionality to be reflected in the slope of the resulting function, as implied by the the formula above.

Upper and Lower Bound for estimator [math]\displaystyle{ K }[/math]

Baraniuk and Wakin (2007) showed that if the number of random projection [math]\displaystyle{ (M) }[/math] is large we can be sure that the projection of a smooth manifold into lower dimensional space will be a near-isometric embedding. The point is that [math]\displaystyle{ M }[/math] is only neede to have linear relation with [math]\displaystyle{ K }[/math] and Logarithmic relationship with [math]\displaystyle{ N }[/math] like
[math]\displaystyle{ M=c K log(N) }[/math]
Also we mentioned that, since random mapping almost preserves metric structure so we need to estimate the ID of the projected data.
Chinmay Hedg and colleagues (2007) showed that under some conditions, for random orthoprojector [math]\displaystyle{ R }[/math] (consists of [math]\displaystyle{ M }[/math] orthogonalized vectors of length [math]\displaystyle{ N }[/math]) and sequence of sample [math]\displaystyle{ X=\left\{x_{1},x_{2},...\right\} }[/math] it can be shown that with probability [math]\displaystyle{ 1-\rho }[/math]

[math]\displaystyle{ (1-\delta)\hat{K}\leq \hat{K}_{R}\leq (1+\delta)\hat{K} }[/math]

for fix [math]\displaystyle{ 0 \lt \delta \lt 1 }[/math] and [math]\displaystyle{ 0 \lt \rho \lt 1 }[/math]

Lower Bound for the number of Random Projections

We can also find the minimum number of projection needed so that, the difference between residuals obtained from applying Isomap on projected data and original data does not exceed an arbitary value.

Let [math]\displaystyle{ \Gamma=max_{1\leq i,j\leq n}d_{iso}(x_{i},x_{j}) }[/math] and call it Diameter of data set [math]\displaystyle{ X=\left\{x_{1},x_{2},...,x_{n}\right\} }[/math] where [math]\displaystyle{ d_{iso}(x_{i},x_{j}) }[/math] is the Isomap estimate of geodesic distance, and also let [math]\displaystyle{ L }[/math] and [math]\displaystyle{ L_{R} }[/math] be the residuals obtained when Isomap generated [math]\displaystyle{ K }[/math]-dimentional embeding of [math]\displaystyle{ X }[/math] and its progection [math]\displaystyle{ RX }[/math], respectively. then we have,

[math]\displaystyle{ L_{R}\lt L+C(n)\Gamma^{2}\epsilon }[/math]

in which [math]\displaystyle{ C(n) }[/math] is a function of the number of sample points.

practical algorithm to determine [math]\displaystyle{ K }[/math] and [math]\displaystyle{ M }[/math]

The above theorems are not able to give us the estimates without some prior information (for example about the volume and condition number of manifold) so mostly, they are called Theoretical Theorems. There is an algorithm to make it practically possible which is called ML-RP Algorithm.

let [math]\displaystyle{ M=1 }[/math] and select a random vector for [math]\displaystyle{ R }[/math]
While residual variance [math]\displaystyle{ \gt \delta }[/math] do

 [math]\displaystyle{ (i) }[/math] for [math]\displaystyle{ X=\left\{x_{1},x_{2},...,x_{n}\right\} }[/math] estimate [math]\displaystyle{ K }[/math] using GP algorithm.
[math]\displaystyle{ (ii) }[/math] Use Isomap to produce an embedding in to [math]\displaystyle{ \hat{K} }[/math]-dimensional space.
[math]\displaystyle{ (iii) }[/math] let [math]\displaystyle{ M=M+1 }[/math] and add one row to [math]\displaystyle{ R }[/math].


EndWhile
Print [math]\displaystyle{ M }[/math] and [math]\displaystyle{ \hat{K} }[/math]