summary

From statwiki
Revision as of 21:11, 16 July 2013 by Lxin (talk | contribs)
Jump to navigation Jump to search

Dimensionality Reduction by Learning an Invariant Mapping


1. Intention

The drawbacks of most existing technique:

1 Most of them depend on a meaningful and computable distance metric in input space. (eg. LLE, Isomap relies on computable distance)

2 They do not compute a “function” that can accurately map new input samples whose relationship to the training data is unknown.

To overcome these drawbacks, this paper introduces a technique called DrLIM. The learning relies solely on neighborhood relationships and does not require any distance measure in the input space.

2. Mathematical Model

Input: A set of vectors [math]\displaystyle{ I=\{x_1,x_2,......,x_p\} }[/math], where [math]\displaystyle{ x_i\in \mathbb{R}^D, \forall i=1,2,3......,n. }[/math]

Output: A parametric function [math]\displaystyle{ G_W:\mathbb{R}^D \rightarrow \mathbb{R}^d }[/math] with [math]\displaystyle{ d\lt \lt D }[/math] such that it satisfies the following 3 properties:1.Simple distance measures in the output space should approximate the neighborhood relationships in the input space; 2.The mapping should not be constrained to implementing simple distance measures in the input space and should be able to learn invariances to complex transformations; 3.It should faithful even for samples whose neighborhood relationship are unknown.

Build up mathematical model: By clustering the data points in high dimension, we could map the high dimensional data points to lower dimension. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] are deemed similar; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:

[math]\displaystyle{ D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2 \qquad (1) }[/math]

[math]\displaystyle{ L(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) }[/math]

[math]\displaystyle{ L(W)= \sum^P_{i=1} L(W,(Y,x_1,x_2)^i) \qquad (3) }[/math]

where [math]\displaystyle{ (Y,x_1,x_2)^i }[/math] is the [math]\displaystyle{ i }[/math]-th labeled sample pair, [math]\displaystyle{ L_S }[/math] is the partial loss function for the points in the same cluster, [math]\displaystyle{ L_D }[/math] is the partial loss function for points in different clusters. [math]\displaystyle{ L_S }[/math] and [math]\displaystyle{ L_D }[/math] are determined by ourselves and must be designed such that minimizing [math]\displaystyle{ L }[/math] respect to W results in low values of [math]\displaystyle{ D_W }[/math] for similar pairs and high values of [math]\displaystyle{ D_W }[/math] for disimilar pairs. The intuition is, minimizing the function [math]\displaystyle{ L(\ ) }[/math] could build up clusters over input data set in the low dimensional space. The exact loss function is

[math]\displaystyle{ L(W, (Y, x_1, x_2))=(1-Y)\frac{1}{2}(D_W)^2+(Y)\frac{1}{2}\{max(0, m-D_W)\}^2 }[/math]

where m>0 is a margin.

Algorithm:

Step 1: For each input sample [math]\displaystyle{ x_i }[/math], do the following:

(a)Using prior knowledge to find the set of samples [math]\displaystyle{ S_{x_i}=\{x_j\}_{j=1}^p }[/math] where [math]\displaystyle{ x_j }[/math] is similar to [math]\displaystyle{ x_i }[/math] for [math]\displaystyle{ 1 \leqslant i \leqslant p }[/math]

(b)Pair the sample [math]\displaystyle{ x_i }[/math] with the other training samples and label the pairs so that:[math]\displaystyle{ Y_ij=0 }[/math] if [math]\displaystyle{ x_j \in S_{x_i} }[/math] and [math]\displaystyle{ Y_{ij}=1 }[/math] otherwise.

Step 2: Repeat until convergence:

(a) For each pair [math]\displaystyle{ (x_i,x_j) }[/math] in the training set, do

i. If [math]\displaystyle{ Y_{ij}=0 }[/math], then update W to decrease [math]\displaystyle{ D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2 }[/math]

ii. If [math]\displaystyle{ Y_{ij}=1 }[/math], then update W to increase [math]\displaystyle{ D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2 }[/math]

3. Experiments

What is GW in practice?

The experiments involving airplane images from NORB dataset use a 2-layer fully connected neural network as GW.

The experiments on the MNIST dataset used a convolutional network as GW. The neighborhood graph is generated with euclidean distances (5 nearest neighbors).