Difference between revisions of "summary"

Dimensionality Reduction by Learning an Invariant Mapping

Jiang, Cong

Song, Ziwei

Ye, Zhaoshan

Zhang, Wenling

Intention

The drawbacks of most existing technique:

1 Most of them depend on a meaningful and computable distance metric in input space. (eg. LLE, Isomap relies on computable distance)

2 They do not compute a “function” that can accurately map new input samples whose relationship to the training data is unknown.

To overcome these drawbacks, this paper introduces a technique called DrLIM. The learning relies solely on neighborhood relationships and does not require any distance measure in the input space.

Mathematical Model

Input: A set of vectors $I=\{x_1,x_2,......,x_p\}$, where $x_i\in \mathbb{R}^D, \forall i=1,2,3......,n.$

Output: A parametric function $G_W:\mathbb{R}^D \rightarrow \mathbb{R}^d$ with $d\lt \lt D$ such that it satisfies the following 3 properties:1.Simple distance measures in the output space should approximate the neighborhood relationships in the input space; 2.The mapping should not be constrained to implementing simple distance measures in the input space and should be able to learn invariances to complex transformations; 3.It should faithful even for samples whose neighborhood relationship are unknown.

Build up mathematical model: By clustering the data points in high dimension, we could map the high dimensional data points to lower dimension. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if $x_1$ and $x_2$ are deemed similar; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:

$D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2 \qquad (1)$

$l(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2)$

$L(W)= \sum^P_{i=1} l(W,(Y,x_1,x_2)^i) \qquad (3)$

where $(Y,x_1,x_2)^i$ is the $i$-th labeled sample pair, $L_S$ is the partial loss function for the points in the same cluster, $L_D$ is the partial loss function for points in different clusters. $L_S$ and $L_D$ are determined by ourselves and must be designed such that minimizing $L$ respect to W results in low values of $D_W$ for similar pairs and high values of $D_W$ for disimilar pairs. The intuition is, minimizing the function $L(\ )$ could build up clusters over input data set in the low dimensional space. The exact loss function is

$l(W, (Y, x_1, x_2))=(1-Y)\frac{1}{2}(D_W)^2+(Y)\frac{1}{2}\{max(0, m-D_W)\}^2$

where m>0 is a margin.

Algorithm:

Step 1: For each input sample $x_i$, do the following:

(a)Using prior knowledge to find the set of samples $S_{x_i}=\{x_j\}_{j=1}^p$ where $x_j$ is similar to $x_i$ for $1 \leqslant i \leqslant p$

(b)Pair the sample $x_i$ with the other training samples and label the pairs so that:$Y_ij=0$ if $x_j \in S_{x_i}$ and $Y_{ij}=1$ otherwise.

Step 2: Repeat until convergence:

(a) For each pair $(x_i,x_j)$ in the training set, do

i. If $Y_{ij}=0$, then update W to decrease $D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2$

ii. If $Y_{ij}=1$, then update W to increase $D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2$

Experiments

How do they define neighbor and what is GW in practice?

The experiments involving airplane images from NORB dataset use a 2-layer fully connected neural network as GW. The neighbor is constructed based on the temporal continuity of the camera, i.e. images are similar if they were taken from contiguous elevation or azimuth regardless of lighting.

The experiments on the MNIST dataset used a convolutional network as GW. The neighborhood graph is generated with euclidean distances (5 nearest neighbors).

Discussion

This method seems very similar to the Large Margin Nearest Neighbor (LMNN) method, because it also aims to maximize the difference between same-label neighbor distance and different-label neighbor distance. One difference is that it uses a heuristic method to find the solution, while the optimization problem of LMNN is convex and thus it can find the exact solution.