proposal for STAT946 projects Fall 2010: Difference between revisions

From statwiki
Jump to navigation Jump to search
m (Conversion script moved page Proposal for STAT946 projects Fall 2010 to proposal for STAT946 projects Fall 2010: Converting page titles to lowercase)
 
(49 intermediate revisions by 7 users not shown)
Line 1: Line 1:
==Project 1 : Sampling Landmarks Using a Convergence Approach ==
==Project 1 : Sampling Landmarks For Landmark MDS Using A Distances-From-The-Center Approach ==
</noinclude>
</noinclude>
===By: Yongpeng Sun===
===By: Yongpeng Sun===


<br>
<br>
Intuition:  
Intuition:
When there is a very large number of data <math>\,n</math>, and a very small portion of them totalling <math>\,k</math> is to be sampled as landmarks for landmark MDS, there are two common approaches:  


When we have a large number of data points and we wish to sample a small portion of them as landmarks, using simple random sampling (SRS) does not guarantee good results. The reason is that, when the sample is very small, SRS sometimes results in a sample that is a small cluster within the data space that would not be representative of all of the data. We would like our landmarks to be as representative of the data set as possible. Thus, we can, starting from the center of the data space, alternatively obtain landmarks that are as near as possible to the center and as far away as possible from the center, effectively removing each newly-sampled landmark by shrinking the distances matrix. During these steps, the closest point from the center and the furthest point from the center from the remaining data space gradually converge towards each other. This is because, during the steps, the distance between the nearest point from the center and the center will gradually increase while at the same time the distance between the furthest point from the center and the center will gradually decrease as the nearest point from the center and the furthest point from the center are effectively removed from the data space.


Before starting the procedure, we require the distances matrix containing the pair-wise distances between the data points. This matrix is constructed only once, and it is updated two times during each step.  
::1. Use simple random sampling (SRS) to sample the <math>\,k</math> landmarks.  




The procedure for obtaining the landmarks can be described by the following steps:
::2. Use K-means to find <math>\,k</math> clusters in the data, and select the <math>\,k</math> points that are closest to the centers of these clusters as the <math>\,k</math> landmarks.


::Step 1:  Find the point closet to the center, and add this point to the to-do queue and the landmarks. Update the distances matrix by removing the row and the column regarding this point.
::Step 2:  For the front-most point in the to-do queue, find the closest point and the furthest point from it, and add these two points to the to-do queue and the landmarks. Update the distances matrix by removing the row and the column regarding each of these two points. Remove the front-most point from the to-do-queue.             
 
::Step 3:  If the number of landmarks suffices, then stop. Otherwise, go back to step 2.


Neither of these approaches however, can guarantee excellent results. In the case of using SRS, sometimes the sampled landmarks comprise a small cluster in the dataset that does not represent the entire dataset well. In the case of using K-means, the set of resulting landmarks is not stable between different runs in that the final <math>\,k</math> centers depend upon what the initial <math>\,k</math> centers are. Thus, another approach is needed to meet the following criteria:




Using 27-dimensional phoneme data, I would like to compare the efficiency of my method of sampling landmarks against the most common approach of sampling landmarks, which is by using simple random sampling (SRS).  
::1. Guarantee that the <math>\,k</math> selected landmarks would represent the dataset well.


<noinclude>


::2. Guarantee that the <math>\,k</math> selected landmarks would be stable between different runs.


( NOTE: STILL BEING EDITED )


To formulate this approach, I observed that if, in each step, I start from the the center of the original training dataset and take the closest point from the center, the furthest point from the center, and the point whose distance from the center is closest to half of the distance between the center and the furthest point from the center, and remove these 3 points from the training dataset before moving on to the next step, then the set of landmarks obtained in this way would meet the two desired criteria above.
If there are clusters in the training dataset, then this approach should also be able to sample more landmarks from clusters as compared to from sparse areas in the training dataset, because of the following two points:
::1. The nearest point from the original center in one step and the next-nearest point from the original center in the next step are more likely to be from the same cluster than not.
::2. The furthest point from the original center in one step and the next-furthest point from the original center in the next step are more likely to be from the same cluster than not.
The procedure is described as follows:
::Step 1: Find the average of the <math>\,n</math> training data points.
::Step 2: Find the training data point that is closest to the average from Step 1, and obtain it as a landmark. In the following steps, this first landmark is simply referred to as the center.
::Step 3: Find the <math>\,n</math> by 1 distances vector that contains the distances between the <math>\,n</math> training data points and the center.
::Step 4:
::: Step 4a: If less than <math>\,k</math> landmarks have been obtained, then find the nearest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.
::: Step 4b: If less than <math>\,k</math> landmarks have been obtained, then find the furthest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3. Temporarily keep track of the distance between this landmark and the center.
::: Step 4c: If less than <math>\,k</math> landmarks have been obtained, then find the training data point whose distance from the center is closest to half of the distance that is temporarily kept track of in Step 4b, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.
::: Step 4d: If <math>\,k</math> landmarks have been obtained, then stop. Otherwise, repeat Step 4.
I would like to compare the effective of my approach to those of using SRS and using K-means for sampling landmarks by applying all three approaches to 27-dimensional labeled phoneme data from the TIMIT database ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 ) for sampling landmarks before using landmark MDS to reduce the dimensionality of the phoneme data and then classifying the phonemes. When I sample for landmarks and use landmark MDS to reduce the dimensionality of my labeled phoneme data, I would ignore the class labels of my data and then, after reducing the dimensionality of my data, I would take into account of the class labels of the resulting data when I use a fixed classifier to classify the resulting data to compare how well my method of sampling for landmarks performs against the SRS method and the K-means method.
<br>
<br>


==Project 2 : Sparse LLE ==
==Project 2 : Sparse LLE ==
Line 33: Line 71:
===By: Manda Winlaw ===
===By: Manda Winlaw ===


One drawback of using Local Linear Embedding (LLE) for dimensionality reduction is that it does not perform well if there are outliers or noise in the data.  To overcome this limitation one possible approach is to require sparsity in the vector of reconstruction weights.  The reconstruction weights for each data point <math>\displaystyle\overline{X}_i</math> are chosen to minimize the reconstruction errors measured by the following cost function,
One drawback of using Local Linear Embedding (LLE) for dimensionality reduction is that it does not perform well if there are outliers or noise in the data.  To overcome this limitation one possible approach is to require sparsity in the vector of reconstruction weights.  The reconstruction weights for each data point <math>\ \overline{X}_i</math> are chosen to minimize the reconstruction errors measured by the following cost function,


<math>
<math>
\displaystyle{E(W) = \sum_i |\overline{X}_i - \sum_j^{K} W_{ij}\overline{X}_j|^2}
E(W) = \sum_i |\overline{X}_i - \sum_{j=1}^{K} W_{ij}\overline{X}_j|^2
</math>
</math>


where the <math>\displaystyle\overline{X}_j, j = 1, \ldots, K</math> are the <math>\displaystyle{K}</math> nearest neighbors of <math>\displaystyle\overline{X}_i</math> and the weights, <math>\displaystyle{W_{ij}}</math>, must satisfy <math>\displaystyle\sum_{j} W_{ij} = 1</math>.   
where the <math>\ \overline{X}_j, j = 1, \ldots, K</math> are the <math>\ K </math> nearest neighbors of <math>\ \overline{X}_i </math> and the weights, <math>\ W_{ij} </math>, must satisfy <math>\ \sum_{j} W_{ij} = 1 </math>.   


From this cost function we can see that if some of the weights are zero due to a sparsity constraint then outliers may be eliminated from the reconstruction.  <br />
From this cost function we can see that if some of the weights are zero due to a sparsity constraint then outliers may be eliminated from the reconstruction.   


To impose the sparsity constraint we introduce an additional constraint into the optimization problem for each data point to get,  
To impose the sparsity constraint we introduce an additional constraint into the optimization problem for each data point to get,  


<math>
<math>
\displaystyle{\min_{\overline{W}_i} |X_i - \sum_j W_{ij}X_j|^2}</math> such that <math> \; \sum_j W_{ij} = 1, \; P(\overline{W}_i) \leq c_1,
\min_{\overline{W}_i} |X_i - \sum_j W_{ij}X_j|^2 </math> such that <math> \; \sum_j W_{ij} = 1, \; P(\overline{W}_i) \leq c_1,
</math>
</math>


where <math>\displaystyle{c_1}</math> is a constant and <math>P(\overline{W}_i)</math> is a convex penalty function of <math>\overline{W}_i</math> which can take different forms.   
where <math>\ c_1</math> is a constant and <math>P(\overline{W}_i)</math> is a convex penalty function of <math>\overline{W}_i</math> which can take different forms.   


Of particular interest, will be <math>P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_2</math> in which case the optimization problem is equivalent to ridge regression and the lasso penalty function, <math>P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_1</math>.  As noted in <ref>R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society
Of particular interest, will be <math>P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_2</math> in which case the optimization problem is equivalent to ridge regression and the lasso penalty function, <math>P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_1</math>.  As noted in <ref>R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society
(Series B), 58:267–288, 1996.</ref>, the lasso penalty function is a more appropriate penalty function for controlling sparsity than the <math>L_2</math>-norm penalty function since the <math>L_2</math>-norm penalty function does not actually give sparse solutions rather the solutions are scaled to give non-sparse low values.  However, using the <math>L_2</math>-norm penalty function results in a closed form solution for the weights.  There is no closed form solution for the lasso penalty function which implies we must use a numerical optimization algorithm to solve this problem increasing the computational complexity of the LLE algorithm.  My goal is examine both of these penalty functions as a means of improving the performance of the LLE algorithm for noisy data and to compare their performance in terms of accuracy and computational complexity.
(Series B), 58:267–288, 1996.</ref>, the lasso penalty function is a more appropriate penalty function for controlling sparsity than the <math>L_2</math>-norm penalty function since the <math>L_2</math>-norm penalty function does not actually give sparse solutions rather the solutions are scaled to give non-sparse low values.  However, using the <math>L_2</math>-norm penalty function results in a closed form solution for the weights.  There is no closed form solution for the lasso penalty function which implies we must use a numerical optimization algorithm to solve this problem increasing the computational complexity of the LLE algorithm.  My goal is examine both of these penalty functions as a means of improving the performance of the LLE algorithm for noisy data and to compare their performance in terms of accuracy and computational complexity.


==References==
===References===
<references />
<references />
==Project 3 : Face Recognition Using LDA-Based Algorithms ==
</noinclude>
===By: LishaYu===
The most successful solution to the problem of feature selection for face representation seems to be those appearance-based approaches, which generally operate directly on images or appearances of face object and process the images as two-dimensional holistic patterns.
However, most of traditional linear discriminant analysis (LDA)-based methods suffer from the disadvantage that their optimality criteria are not directly related to the classification ability of the obtained feature representation. Moreover, their classification accuracy is affected by the “small sample size” (SSS) problem that is often encountered in FR tasks.
This paper introduces a new algorithm that deals with both of the shortcomings in an efficient and cost manner – Direct Fractional - Step LDA (DF-LDA). The following subtopics are included:
:1. Introduction of Direct Fractional - Step LDA (DF-LDA)
:    - Where are the Optimal Discriminant Features?
:    - Variation f D-LDA
:    - Rotation and Reorientation of D-LDA Subspace
:2. Experimental Result
The new algorithm is compared, in terms of classification accuracy, to other commonly used FR methods on two face databases. Results indicate that the performance of the proposed method is overall superior to those of traditional FR approaches, such as the Eigenfaces, Fisherfaces, and D-LDA methods.
If time is allowed, LDA in small sample size (SSS) scenarios with application to face recognition will be included.
==Project 4 : Supervised Independent Component Analysis Based on Hilbert-Schmidt Independence Criterion ==
</noinclude>
===By: Fatemeh Dorri===
Independent component analysis has been given more attention recently. It is become a popular method for estimating the independent feature of a data. ICA is an unsupervised algorithm that seeks for a set of linear basis vectors which are independent. From statistics viewpoint, principle component analysis (PCA) makes the second order statistical information independent, but ICA makes high-order statistical independence. However both ICA and PCA express linear representation of data and they are not capable of nonlinear cases. ICA is also similar to projection pursuit method in a sense that they are both searching for directions which have maximally “non-gaussian” distribution since these projections are more interested and more useful for classification. The results of ICA algorithm will be promising if the independence assumption is right and all the components have not Gaussian distribution.
ICA is basically grouped in unsupervised algorithms that don’t use the class information when feature extraction is implemented. So, the features maximum separability is not ensured.On the mentioned paper, the class separability is enhanced by maximizing the Mahalanobis distance between classes. It is also showed that the suggested SICA is outperformed ICA for recognition.
We propose a supervised Independent component analysis (supervised ICA) algorithm that can be useful for finding the main independent component of a set of data based on their class labels. The algorithm take advantage of HSIC that navigate FastICA for finding independent components based on their dependencies with class labels. In result part we compare the result of ICA and our proposed method on two set data.
==Project 5 : Measuring Scene Changes in Images ==
</noinclude>
===By: Ryan Case===
I propose an empirical evaluation of several distance metric learning techniques to time-lapsed images in order to understand if these methods are able to measure changes of particular explanatory variables---specifically, the changes in an image due to lighting intensity, shadows, and meteorological conditions. Ultimately, I seek to understand if distance metric learning can be used to attribute changes in the performance of photovoltaic cells to particular changes in environmental conditions.
The full proposal can be found [http://www.case.ca/ryan/academic/stat946-proposal.pdf here]
==Project 6: Dimensionality Reduction for Supervised Learning  ==
</noinclude>
===By: Greg D'Cunha ===
I will work on the idea that Professor Ghodsi mentioned in class - using alternate measures of distance for MDS. Namely, I will look into the 'measures' used for SNE and t-SNE and see how they perform compared to standard MDS and improved methods such as ISOMAP.
==Project 7 : Local Tangent Space Alignment and t-Distributed Stochastic Neighbor Embedding Techniques and Applicability to the Molecular Conformation Problem ==
</noinclude>
===By: Laleh Ghoraie===
Molecular conformation problem in the simplest form is defined as follows. Given a matrix of pairwise distances of atoms of a molecule, goal is to find the correct structure of the protein molecule with respect to its atoms and their given distances <ref name="bis">Biswas, P., Toh K.-C., and Ye Y. (2008) A distributed SDP approach for large scale noisy anchor-free graph realization with applications to molecular conformation. SIAM J. on Sci. Comp., 30, 1251-1277.</ref>. Knowledge of the protein structure/conformation is crucial to the biologist. This knowledge helps in understanding the function of proteins which is vital in, for instance, studying how diseases attack the body and how appropriate drugs should be designed to cure them.
The final solution to the conformation problem should satisfy a few distance constraints. Distances between some known atoms are nearly fixed in nature, and for others we may obtain a lower and upper bound by special measurement tools. This problem is a version of graph realization problem in which the objective is finding the vertices of the graph given the matrix of edge lengths and constraints on them.
Given the exact interatomic distances, the problem can be solved and a valid conformation can be obtained by eigenvalue decomposition of Gram matrix (an inner product matrix) computed from linear transformation of the distance matrix. The difficulties come up when the distance matrix is incomplete or includes noisy data, which is a common case.
Several approaches have been applied to this problem. The closest family of approaches to the dimensionality reduction is the multidimensional scaling (MDS)<ref name = "tros">Trosset M. W. (1998) Applications of multidimensional scaling to molecular conformation, Computing Science and Statistics, 29, 148-152. </ref>. This method obtains molecule structure by minimizing the difference between the interatomic distances of the estimated conformation and the given distances. Another similar (to the MDS) approach is applying a semidefinite programming (SDP) relaxation of the conformation problem formulation  <ref name="bis"/> <ref>Leung N. H., and Toh K.-C. (2009) An SDP-based divide-and-conquer algorithm for large-scale noisy anchor-free graph realization. SIAM J. on Sci. Comp., 31, 6, 4351-4372. </ref>.
In this project, we investigate the applicability of two well-known manifold learning techniques to the conformation problem. First technique is named local tangent space alignment (LTSA) <ref name="zhan">Zhang, Z. and Zha, H. (2002) Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Scientific Computing, 26(1): 313-338. </ref>. LTSA applies PCA in the neighborhood of each node and uses this to estimate tangent space at the node.
LTSA basically represents the local geometry of data by means of the tangent space in the neighborhood of points, and constructs the global coordinate system for the nonlinear manifold by aligning the local tangent spaces <ref name="zhan"/>. The algorithm performs in two steps. In the first step, local information at each point is extracted, and in the second step, a global alignment is computed <ref>Liu H., Luo X., and Yao Y. (2007) Two manifold learning techniques for sensor localization. Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on, pp. 2114-2118.
</ref>.
Second technique is named t-distributed stochastic neighbor embedding (t-SNE) <ref name="tros"/>. This technique is based on stochastic neighbor embedding <ref>Hinton G. E., and Roweis S. T. (2002) Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, volume 15, 833-840, Cambridge, MA, USA. The MIT Press.</ref>, which uses a conditional probabilistic to represent similarities between points based on their pairwise distances.
Each conditional probability indicates the probability that a point may pick up another one as its neighbor. If the high dimensional data is embedded correctly to the low dimension, the conditional probability of two points in the high dimension should be equal to the conditional probability of their corresponding (mapped) points in low dimension. Therefore, the objective function in this method aims to minimize the difference between these two probabilities. 
One important objective of dimensionality reduction is to retain the local structure of the high dimensional data in the embedded data in the low dimension as much as possible. t-SNE has proved to be one of the best in achieving this goal. We will examine how important this t-SNE aspect can be in obtaining an as precise as possible conformation for a molecule in low dimension.
On the other hand, as mentioned before, the key idea behind techniques such as local linear embedding (LLE) and LTSA is also that the information about the global structure of the data can be obtained from the interactions of local structures <ref name="zhan"/>. This property makes LTSA, also, an interesting option to be applied to the conformation problem.
===References===
<references/>

Latest revision as of 08:45, 30 August 2017

Project 1 : Sampling Landmarks For Landmark MDS Using A Distances-From-The-Center Approach

By: Yongpeng Sun


Intuition: When there is a very large number of data [math]\displaystyle{ \,n }[/math], and a very small portion of them totalling [math]\displaystyle{ \,k }[/math] is to be sampled as landmarks for landmark MDS, there are two common approaches:


1. Use simple random sampling (SRS) to sample the [math]\displaystyle{ \,k }[/math] landmarks.


2. Use K-means to find [math]\displaystyle{ \,k }[/math] clusters in the data, and select the [math]\displaystyle{ \,k }[/math] points that are closest to the centers of these clusters as the [math]\displaystyle{ \,k }[/math] landmarks.


Neither of these approaches however, can guarantee excellent results. In the case of using SRS, sometimes the sampled landmarks comprise a small cluster in the dataset that does not represent the entire dataset well. In the case of using K-means, the set of resulting landmarks is not stable between different runs in that the final [math]\displaystyle{ \,k }[/math] centers depend upon what the initial [math]\displaystyle{ \,k }[/math] centers are. Thus, another approach is needed to meet the following criteria:


1. Guarantee that the [math]\displaystyle{ \,k }[/math] selected landmarks would represent the dataset well.


2. Guarantee that the [math]\displaystyle{ \,k }[/math] selected landmarks would be stable between different runs.


To formulate this approach, I observed that if, in each step, I start from the the center of the original training dataset and take the closest point from the center, the furthest point from the center, and the point whose distance from the center is closest to half of the distance between the center and the furthest point from the center, and remove these 3 points from the training dataset before moving on to the next step, then the set of landmarks obtained in this way would meet the two desired criteria above.


If there are clusters in the training dataset, then this approach should also be able to sample more landmarks from clusters as compared to from sparse areas in the training dataset, because of the following two points:


1. The nearest point from the original center in one step and the next-nearest point from the original center in the next step are more likely to be from the same cluster than not.


2. The furthest point from the original center in one step and the next-furthest point from the original center in the next step are more likely to be from the same cluster than not.


The procedure is described as follows:


Step 1: Find the average of the [math]\displaystyle{ \,n }[/math] training data points.


Step 2: Find the training data point that is closest to the average from Step 1, and obtain it as a landmark. In the following steps, this first landmark is simply referred to as the center.


Step 3: Find the [math]\displaystyle{ \,n }[/math] by 1 distances vector that contains the distances between the [math]\displaystyle{ \,n }[/math] training data points and the center.


Step 4:


Step 4a: If less than [math]\displaystyle{ \,k }[/math] landmarks have been obtained, then find the nearest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.


Step 4b: If less than [math]\displaystyle{ \,k }[/math] landmarks have been obtained, then find the furthest training data point from the center, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3. Temporarily keep track of the distance between this landmark and the center.


Step 4c: If less than [math]\displaystyle{ \,k }[/math] landmarks have been obtained, then find the training data point whose distance from the center is closest to half of the distance that is temporarily kept track of in Step 4b, take it as a landmark, and remove it from the training dataset by removing the distance associated with it in the distances vector from Step 3.


Step 4d: If [math]\displaystyle{ \,k }[/math] landmarks have been obtained, then stop. Otherwise, repeat Step 4.


I would like to compare the effective of my approach to those of using SRS and using K-means for sampling landmarks by applying all three approaches to 27-dimensional labeled phoneme data from the TIMIT database ( http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 ) for sampling landmarks before using landmark MDS to reduce the dimensionality of the phoneme data and then classifying the phonemes. When I sample for landmarks and use landmark MDS to reduce the dimensionality of my labeled phoneme data, I would ignore the class labels of my data and then, after reducing the dimensionality of my data, I would take into account of the class labels of the resulting data when I use a fixed classifier to classify the resulting data to compare how well my method of sampling for landmarks performs against the SRS method and the K-means method.



Project 2 : Sparse LLE

By: Manda Winlaw

One drawback of using Local Linear Embedding (LLE) for dimensionality reduction is that it does not perform well if there are outliers or noise in the data. To overcome this limitation one possible approach is to require sparsity in the vector of reconstruction weights. The reconstruction weights for each data point [math]\displaystyle{ \ \overline{X}_i }[/math] are chosen to minimize the reconstruction errors measured by the following cost function,

[math]\displaystyle{ E(W) = \sum_i |\overline{X}_i - \sum_{j=1}^{K} W_{ij}\overline{X}_j|^2 }[/math]

where the [math]\displaystyle{ \ \overline{X}_j, j = 1, \ldots, K }[/math] are the [math]\displaystyle{ \ K }[/math] nearest neighbors of [math]\displaystyle{ \ \overline{X}_i }[/math] and the weights, [math]\displaystyle{ \ W_{ij} }[/math], must satisfy [math]\displaystyle{ \ \sum_{j} W_{ij} = 1 }[/math].

From this cost function we can see that if some of the weights are zero due to a sparsity constraint then outliers may be eliminated from the reconstruction.

To impose the sparsity constraint we introduce an additional constraint into the optimization problem for each data point to get,

[math]\displaystyle{ \min_{\overline{W}_i} |X_i - \sum_j W_{ij}X_j|^2 }[/math] such that [math]\displaystyle{ \; \sum_j W_{ij} = 1, \; P(\overline{W}_i) \leq c_1, }[/math]

where [math]\displaystyle{ \ c_1 }[/math] is a constant and [math]\displaystyle{ P(\overline{W}_i) }[/math] is a convex penalty function of [math]\displaystyle{ \overline{W}_i }[/math] which can take different forms.

Of particular interest, will be [math]\displaystyle{ P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_2 }[/math] in which case the optimization problem is equivalent to ridge regression and the lasso penalty function, [math]\displaystyle{ P(\overline{W}_i) = \left\| \overline{W}_i \right\|^2_1 }[/math]. As noted in <ref>R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267–288, 1996.</ref>, the lasso penalty function is a more appropriate penalty function for controlling sparsity than the [math]\displaystyle{ L_2 }[/math]-norm penalty function since the [math]\displaystyle{ L_2 }[/math]-norm penalty function does not actually give sparse solutions rather the solutions are scaled to give non-sparse low values. However, using the [math]\displaystyle{ L_2 }[/math]-norm penalty function results in a closed form solution for the weights. There is no closed form solution for the lasso penalty function which implies we must use a numerical optimization algorithm to solve this problem increasing the computational complexity of the LLE algorithm. My goal is examine both of these penalty functions as a means of improving the performance of the LLE algorithm for noisy data and to compare their performance in terms of accuracy and computational complexity.

References

<references />


Project 3 : Face Recognition Using LDA-Based Algorithms

By: LishaYu

The most successful solution to the problem of feature selection for face representation seems to be those appearance-based approaches, which generally operate directly on images or appearances of face object and process the images as two-dimensional holistic patterns.

However, most of traditional linear discriminant analysis (LDA)-based methods suffer from the disadvantage that their optimality criteria are not directly related to the classification ability of the obtained feature representation. Moreover, their classification accuracy is affected by the “small sample size” (SSS) problem that is often encountered in FR tasks.

This paper introduces a new algorithm that deals with both of the shortcomings in an efficient and cost manner – Direct Fractional - Step LDA (DF-LDA). The following subtopics are included:

1. Introduction of Direct Fractional - Step LDA (DF-LDA)
- Where are the Optimal Discriminant Features?
- Variation f D-LDA
- Rotation and Reorientation of D-LDA Subspace
2. Experimental Result

The new algorithm is compared, in terms of classification accuracy, to other commonly used FR methods on two face databases. Results indicate that the performance of the proposed method is overall superior to those of traditional FR approaches, such as the Eigenfaces, Fisherfaces, and D-LDA methods.

If time is allowed, LDA in small sample size (SSS) scenarios with application to face recognition will be included.

Project 4 : Supervised Independent Component Analysis Based on Hilbert-Schmidt Independence Criterion

By: Fatemeh Dorri

Independent component analysis has been given more attention recently. It is become a popular method for estimating the independent feature of a data. ICA is an unsupervised algorithm that seeks for a set of linear basis vectors which are independent. From statistics viewpoint, principle component analysis (PCA) makes the second order statistical information independent, but ICA makes high-order statistical independence. However both ICA and PCA express linear representation of data and they are not capable of nonlinear cases. ICA is also similar to projection pursuit method in a sense that they are both searching for directions which have maximally “non-gaussian” distribution since these projections are more interested and more useful for classification. The results of ICA algorithm will be promising if the independence assumption is right and all the components have not Gaussian distribution. ICA is basically grouped in unsupervised algorithms that don’t use the class information when feature extraction is implemented. So, the features maximum separability is not ensured.On the mentioned paper, the class separability is enhanced by maximizing the Mahalanobis distance between classes. It is also showed that the suggested SICA is outperformed ICA for recognition.

We propose a supervised Independent component analysis (supervised ICA) algorithm that can be useful for finding the main independent component of a set of data based on their class labels. The algorithm take advantage of HSIC that navigate FastICA for finding independent components based on their dependencies with class labels. In result part we compare the result of ICA and our proposed method on two set data.

Project 5 : Measuring Scene Changes in Images

By: Ryan Case

I propose an empirical evaluation of several distance metric learning techniques to time-lapsed images in order to understand if these methods are able to measure changes of particular explanatory variables---specifically, the changes in an image due to lighting intensity, shadows, and meteorological conditions. Ultimately, I seek to understand if distance metric learning can be used to attribute changes in the performance of photovoltaic cells to particular changes in environmental conditions.

The full proposal can be found here

Project 6: Dimensionality Reduction for Supervised Learning

By: Greg D'Cunha

I will work on the idea that Professor Ghodsi mentioned in class - using alternate measures of distance for MDS. Namely, I will look into the 'measures' used for SNE and t-SNE and see how they perform compared to standard MDS and improved methods such as ISOMAP.

Project 7 : Local Tangent Space Alignment and t-Distributed Stochastic Neighbor Embedding Techniques and Applicability to the Molecular Conformation Problem

By: Laleh Ghoraie

Molecular conformation problem in the simplest form is defined as follows. Given a matrix of pairwise distances of atoms of a molecule, goal is to find the correct structure of the protein molecule with respect to its atoms and their given distances <ref name="bis">Biswas, P., Toh K.-C., and Ye Y. (2008) A distributed SDP approach for large scale noisy anchor-free graph realization with applications to molecular conformation. SIAM J. on Sci. Comp., 30, 1251-1277.</ref>. Knowledge of the protein structure/conformation is crucial to the biologist. This knowledge helps in understanding the function of proteins which is vital in, for instance, studying how diseases attack the body and how appropriate drugs should be designed to cure them.

The final solution to the conformation problem should satisfy a few distance constraints. Distances between some known atoms are nearly fixed in nature, and for others we may obtain a lower and upper bound by special measurement tools. This problem is a version of graph realization problem in which the objective is finding the vertices of the graph given the matrix of edge lengths and constraints on them.

Given the exact interatomic distances, the problem can be solved and a valid conformation can be obtained by eigenvalue decomposition of Gram matrix (an inner product matrix) computed from linear transformation of the distance matrix. The difficulties come up when the distance matrix is incomplete or includes noisy data, which is a common case.

Several approaches have been applied to this problem. The closest family of approaches to the dimensionality reduction is the multidimensional scaling (MDS)<ref name = "tros">Trosset M. W. (1998) Applications of multidimensional scaling to molecular conformation, Computing Science and Statistics, 29, 148-152. </ref>. This method obtains molecule structure by minimizing the difference between the interatomic distances of the estimated conformation and the given distances. Another similar (to the MDS) approach is applying a semidefinite programming (SDP) relaxation of the conformation problem formulation <ref name="bis"/> <ref>Leung N. H., and Toh K.-C. (2009) An SDP-based divide-and-conquer algorithm for large-scale noisy anchor-free graph realization. SIAM J. on Sci. Comp., 31, 6, 4351-4372. </ref>.

In this project, we investigate the applicability of two well-known manifold learning techniques to the conformation problem. First technique is named local tangent space alignment (LTSA) <ref name="zhan">Zhang, Z. and Zha, H. (2002) Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Scientific Computing, 26(1): 313-338. </ref>. LTSA applies PCA in the neighborhood of each node and uses this to estimate tangent space at the node.

LTSA basically represents the local geometry of data by means of the tangent space in the neighborhood of points, and constructs the global coordinate system for the nonlinear manifold by aligning the local tangent spaces <ref name="zhan"/>. The algorithm performs in two steps. In the first step, local information at each point is extracted, and in the second step, a global alignment is computed <ref>Liu H., Luo X., and Yao Y. (2007) Two manifold learning techniques for sensor localization. Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on, pp. 2114-2118. </ref>.

Second technique is named t-distributed stochastic neighbor embedding (t-SNE) <ref name="tros"/>. This technique is based on stochastic neighbor embedding <ref>Hinton G. E., and Roweis S. T. (2002) Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, volume 15, 833-840, Cambridge, MA, USA. The MIT Press.</ref>, which uses a conditional probabilistic to represent similarities between points based on their pairwise distances.

Each conditional probability indicates the probability that a point may pick up another one as its neighbor. If the high dimensional data is embedded correctly to the low dimension, the conditional probability of two points in the high dimension should be equal to the conditional probability of their corresponding (mapped) points in low dimension. Therefore, the objective function in this method aims to minimize the difference between these two probabilities.

One important objective of dimensionality reduction is to retain the local structure of the high dimensional data in the embedded data in the low dimension as much as possible. t-SNE has proved to be one of the best in achieving this goal. We will examine how important this t-SNE aspect can be in obtaining an as precise as possible conformation for a molecule in low dimension.

On the other hand, as mentioned before, the key idea behind techniques such as local linear embedding (LLE) and LTSA is also that the information about the global structure of the data can be obtained from the interactions of local structures <ref name="zhan"/>. This property makes LTSA, also, an interesting option to be applied to the conformation problem.

References

<references/>