summary
Dimensionality Reduction by Learning an Invariant Mapping
1. Intention
The drawbacks of most existing technique:
1 Most of them depend on a meaningful and computable distance metric in input space. (eg. LLE, Isomap relies on computable distance)
2 They do not compute a “function” that can accurately map new input samples whose relationship to the training data is unknown.
To overcome these drawbacks, this paper introduces a technique called DrLIM. The learning relies solely on neighborhood relationships and does not require any distance measure in the input space.
2. Mathematical Model
Input: A set of vectors [math]\displaystyle{ I=\{x_1,x_2,......,x_p\} }[/math], where [math]\displaystyle{ x_i\in \mathbb{R}^D, \forall i=1,2,3......,n. }[/math]
Output: A parametric function [math]\displaystyle{ G_W:\mathbb{R}^D \rightarrow \mathbb{R}^d }[/math] with [math]\displaystyle{ d\lt \lt D }[/math] such that it satisfies the following 3 properties:1.Simple distance measures in the output space should approximate the neighborhood relationships in the input space; 2.The mapping should not be constrained to implementing simple distance measures in the input space and should be able to learn invariances to complex transformations; 3.It should faithful even for samples whose neighborhood relationship are unknown.
Build up mathematical model: By clustering the data points in high dimension, we could map the high dimensional data points to lower dimension. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] are in the same cluster; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:
[math]\displaystyle{ D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2 \qquad (1) }[/math]
[math]\displaystyle{ l(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) }[/math]
[math]\displaystyle{ L(W)= \sum^P_{i=1} l(W,(Y,x_1,x_2)^i) \qquad (3) }[/math]
where [math]\displaystyle{ (Y,x_1,x_2)^i }[/math] is the [math]\displaystyle{ i }[/math]-th labeled sample pair, [math]\displaystyle{ L_S }[/math] is the partial loss function for the points in the same cluster, [math]\displaystyle{ L_D }[/math] is the partial loss function for points in different clusters. [math]\displaystyle{ L_S }[/math] and [math]\displaystyle{ L_D }[/math] are determined by ourselves and must be designed such that minimizing [math]\displaystyle{ L }[/math] respect to W results in low values of [math]\displaystyle{ D_W }[/math] for similar pairs and high values of [math]\displaystyle{ D_W }[/math] for disimilar pairs. The intuition is, minimizing the function [math]\displaystyle{ L(\ ) }[/math] could build up clusters over input data set.
Algorithm: Step 1: For each input sample [math]\displaystyle{ x_i }[/math], do the following:
(a)Using prior knowledge to find the set of samples [math]\displaystyle{ S_{x_i}=\{x_j\}_{j=1}^p }[/math] where [math]\displaystyle{ x_j }[/math] is similar to [math]\displaystyle{ x_i }[/math] for [math]\displaystyle{ 1 \leqslant i \leqslant p }[/math]
(b)Pair the sample [math]\displaystyle{ x_i }[/math] with the other training samples and label the pairs so that:[math]\displaystyle{ Y_ij=0 }[/math] if [math]\displaystyle{ x_j \in S_{x_i} }[/math] and [math]\displaystyle{ Y_{ij}=1 }[/math] otherwise.
Step 2: Repeat until convergence:
(a) For each pair [math]\displaystyle{ (x_i,x_j) }[/math] in the training set, do
i. If [math]\displaystyle{ Y_{ij}=0 }[/math], then update W to decrease [math]\displaystyle{ D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2 }[/math]
ii. If [math]\displaystyle{ Y_{ij}=1 }[/math], then update W to increase [math]\displaystyle{ D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2 }[/math]