summary: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 21: Line 21:


Solution:
Solution:
Build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if <math> x_1 </math> and <math> x_2 </math> are in the same cluster; otherwise set Y=1. Define following functions:
The high dimensional data points could be mapped to lower dimension in several clusters. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if <math> x_1 </math> and <math> x_2 </math> are in the same cluster; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:
<center>
<center>
<math> D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2  \qquad (1) </math>  
<math> D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2  \qquad (1) </math>  


<math> L(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) </math>
<math> l(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) </math>


<math> \mathbb{L} (W)= \sum^P_{i=1} L(W,(Y,x_1,x_2)^i) \qquad (3) </math>   
<math> L(W)= \sum^P_{i=1} l(W,(Y,x_1,x_2)^i) \qquad (3) </math>   
</center>
</center>
 
where <math>(Y,x_1,x_2)^i</math> is the <math>i</math>-th labeled sample pair, <math>L_S</math> is the partial loss function for the points in the same cluster, <math>L_D</math> is the partial loss function for points in different clusters.
<math>L_S</math> and <math>L_D</math> are determined by ourselves and must be designed such that minimizing <math>L</math> respect to W results in low values of <math>D_W</math> for similar pairs and high values of <math>D_W</math> for similar pairs.
<math>L_S</math> and <math>L_D</math> are determined by ourselves and must be designed such that minimizing <math>L</math> respect to W results in low values of <math>D_W </math> for similar pairs and high values of <math>D_W </math> for disimilar pairs. The intuition is, minimizing the function <math>L(\ )</math> could build up clusters over input data set.

Revision as of 21:20, 12 July 2013

Dimensionality Reduction by Learning an Invariant Mapping


1. Intention

The drawbacks of most existing technique:

1 Most of them depend on a meaningful and computable distance metric in input space. (eg. LLE, Isomap relies on computable distance)

2 They do not compute a “function” that can accurately map new input samples whose relationship to the training data is unknown.

To overcome these drawbacks, this paper introduces a technique called DrLIM. The learning relies solely on neighborhood relationships and does not require any distance measure in the input space.

2. Mathematical Model

Input: A set of vectors [math]\displaystyle{ I=\{x_1,x_2,......,x_p\} }[/math], where [math]\displaystyle{ x_i\in \mathbb{R}^D, \forall i=1,2,3......,n. }[/math]

Output: A parametric function [math]\displaystyle{ G_W:\mathbb{R}^D \rightarrow \mathbb{R}^d }[/math] with [math]\displaystyle{ d\lt \lt D }[/math] such that it satisfies the following 3 properties:1.Simple distance measures in the output space should approximate the neighborhood relationships in the input space; 2.The mapping should not be constrained to implementing simple distance measures in the input space and should be able to learn invariances to complex transformations; 3.It should faithful even for samples whose neighborhood relationship are unknown.

Solution: The high dimensional data points could be mapped to lower dimension in several clusters. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] are in the same cluster; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:

[math]\displaystyle{ D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2 \qquad (1) }[/math]

[math]\displaystyle{ l(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) }[/math]

[math]\displaystyle{ L(W)= \sum^P_{i=1} l(W,(Y,x_1,x_2)^i) \qquad (3) }[/math]

where [math]\displaystyle{ (Y,x_1,x_2)^i }[/math] is the [math]\displaystyle{ i }[/math]-th labeled sample pair, [math]\displaystyle{ L_S }[/math] is the partial loss function for the points in the same cluster, [math]\displaystyle{ L_D }[/math] is the partial loss function for points in different clusters. [math]\displaystyle{ L_S }[/math] and [math]\displaystyle{ L_D }[/math] are determined by ourselves and must be designed such that minimizing [math]\displaystyle{ L }[/math] respect to W results in low values of [math]\displaystyle{ D_W }[/math] for similar pairs and high values of [math]\displaystyle{ D_W }[/math] for disimilar pairs. The intuition is, minimizing the function [math]\displaystyle{ L(\ ) }[/math] could build up clusters over input data set.