# proposal for STAT946 projects Fall 2010

## Project 1 : Sampling Landmarks For Landmark MDS Using A Distances-From-The-Center Approach

### By: Yongpeng Sun

Intuition: When there is a very large number of data $\,n$, and a very small portion of them totalling $\,k$ is to be sampled as landmarks for landmark MDS, there are two common approaches:

1. Use simple random sampling (SRS) to sample the $\,k$ landmarks.

2. Use K-means to find $\,k$ clusters in the data, and select the $\,k$ points that are closest to the centers of these clusters as the $\,k$ landmarks.

Neither of these approaches however, can guarantee excellent results. In the case of using SRS, sometimes the sampled landmarks comprise a small cluster in the dataset that does not represent the entire dataset well. In the case of using K-means, the set of resulting landmarks is not stable between different runs in that the final $\,k$ centers depend upon what the initial $\,k$ centers are. Thus, another approach is needed to meet the following criteria:

1. Guarantee that the $\,k$ selected landmarks would represent the dataset well.

2. Guarantee that the $\,k$ selected landmarks would be stable between different runs.

To formulate this approach, I observed that if, in each step, I start from the the center of the original training dataset and take the closest point from the center, the furthest point from the center, and the point whose distance from the center is closest to half of the distance between the center and the furthest point from the center, and remove these 3 points from the training dataset before moving on to the next step, then the set of landmarks obtained in this way would meet the two desired criteria above.

If there are clusters in the training dataset, then this approach should also be able to sample more landmarks from clusters as compared to from sparse areas in the training dataset, because of the following two points:

1. The nearest point from the original center in one step and the next-nearest point from the original center in the next step are more likely to be from the same cluster than not.

2. The furthest point from the original center in one step and the next-furthest point from the original center in the next step are more likely to be from the same cluster than not.

The procedure is described as follows:

Step 1: Find the average of the $\,n$ training data points.

Step 2: Find the training data point that is closest to the average from Step 1, and obtain it as a landmark. In the following steps, this first landmark is simply referred to as the center.

Step 3: Find the $\,n$ by 1 distances vector that contains the distances between the $\,n$ training data points and the center.

Step 4:

Step 4a: If less than $\,k$ landmarks have been obtained, then find the nearest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.

Step 4b: If less than $\,k$ landmarks have been obtained, then find the furthest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3. Temporarily keep track of the distance between this landmark and the center.

Step 4c: If less than $\,k$ landmarks have been obtained, then find the training data point whose distance from the center is closest to half of the distance that is temporarily kept track of in Step 4b, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.

Step 4d: If $\,k$ landmarks have been obtained, then stop. Otherwise, repeat Step 4.

I would like to compare the effective of my approach to those of using SRS and using K-means for sampling landmarks by applying all three approaches to 27-dimensional labeled phoneme data from the TIMIT database ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 ) for sampling landmarks before using landmark MDS to reduce the dimensionality of the phoneme data and then classifying the phonemes. When I sample for landmarks and use landmark MDS to reduce the dimensionality of my labeled phoneme data, I would ignore the class labels of my data and then, after reducing the dimensionality of my data, I would take into account of the class labels of the resulting data when I use a fixed classifier to classify the resulting data to compare how well my method of sampling for landmarks performs against the SRS method and the K-means method.

## Project 2 : Sparse LLE

### By: Manda Winlaw

One drawback of using Local Linear Embedding (LLE) for dimensionality reduction is that it does not perform well if there are outliers or noise in the data. To overcome this limitation one possible approach is to require sparsity in the vector of reconstruction weights. The reconstruction weights for each data point $\ \overline{X}_i$ are chosen to minimize the reconstruction errors measured by the following cost function,

$E(W) = \sum_i |\overline{X}_i - \sum_{j=1}^{K} W_{ij}\overline{X}_j|^2$

where the $\ \overline{X}_j, j = 1, \ldots, K$ are the $\ K$ nearest neighbors of $\ \overline{X}_i$ and the weights, $\ W_{ij}$, must satisfy $\ \sum_{j} W_{ij} = 1$.

From this cost function we can see that if some of the weights are zero due to a sparsity constraint then outliers may be eliminated from the reconstruction.

To impose the sparsity constraint we introduce an additional constraint into the optimization problem for each data point to get,

$\min_{\overline{W}_i} |X_i - \sum_j W_{ij}X_j|^2$ such that $\; \sum_j W_{ij} = 1, \; P(\overline{W}_i) \leq c_1,$

where $\ c_1$ is a constant and $P(\overline{W}_i)$ is a convex penalty function of $\overline{W}_i$ which can take different forms.

Of particular interest, will be $P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_2$ in which case the optimization problem is equivalent to ridge regression and the lasso penalty function, $P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_1$. As noted in <ref>R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267–288, 1996.</ref>, the lasso penalty function is a more appropriate penalty function for controlling sparsity than the $L_2$-norm penalty function since the $L_2$-norm penalty function does not actually give sparse solutions rather the solutions are scaled to give non-sparse low values. However, using the $L_2$-norm penalty function results in a closed form solution for the weights. There is no closed form solution for the lasso penalty function which implies we must use a numerical optimization algorithm to solve this problem increasing the computational complexity of the LLE algorithm. My goal is examine both of these penalty functions as a means of improving the performance of the LLE algorithm for noisy data and to compare their performance in terms of accuracy and computational complexity.