proposal Fall 2010

From statwiki
Revision as of 15:44, 21 October 2010 by Y24Sun (talk | contribs) (By: Yongpeng Sun)
Jump to: navigation, search

Project 1 : Classifying New Data Points by Minimizing Their Chances of Being Outliers

By: Yongpeng Sun


  • In LDA, we assign a new data point to the class having the least distance to the center. At the same time, however, it is desirable to take into account how much the new point would seem like an outlier in each class.
  • For a new data point, we would like to assign it to the class such that it least seems to be an outlier in that class. To this end, the new data point should not only be as close as possible to the center of its assigned class, but it should also have the least possible orthogonal distance to the direction of maximum variation in its assigned class.

Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:

Step 1: Use PCA to project the given data onto the 2-dimensional space for better visualization.

Step 2: For each class, find its center and the direction of its maximum variation.

Step 3: For the new data point, find its distances to these two centers [math]\,d_{0,center}[/math] and [math]\,d_{1,center} [/math], and find its distances to these two directions [math]\,d_{0,direction}[/math] and [math]\,d_{1,direction}[/math].
Then, assign the point to the class having the least [math]\,d_{i,center} + d_{i,direction}[/math], where [math]\,i\in\{0, 1\}[/math], so that it is least likely to be an outlier in the class that it is assigned to.

This procedure can be easily generalized to the [math]\,k[/math]-classes case where [math]\,k \ge 2[/math], because only the centers and the main directions of variation of the classes are needed to assign a new data point to a class.

I would like to evaluate the effectiveness of my idea / algorithm as compared to the LDA / QDA and other classifiers using data sets in the UCI database ( ).