proposal Fall 2010

From statwiki
Jump to navigation Jump to search

Project 1 : Classifying New Data Points by Minimizing Their Chances of Being Outliers

By: Yongpeng Sun

Intuition:

In LDA, we assign a new data point to the class having the least distance to the center. At the same time, however, it is desirable to take into account how much the new point would seem like an outlier in each class.

For a new data point, we would like to assign it to the class such that it least seems to be an outlier in that class. To this end, the new data point should not only be as close as possible to the center of its assigned class, but it should also have the least possible orthogonal distance to the direction of maximum variation in its assigned class.


Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:

Step 1: Use PCA to project the given data onto the 2-dimensional space for better visualization.


Step 2: For each class, find its center and the direction of its maximum variation.


Step 3: For the new data point, find its distances to these two centers [math]\displaystyle{ \,d_{0,center} }[/math] and [math]\displaystyle{ \,d_{1,center} }[/math], and find its distances to these two directions [math]\displaystyle{ \,d_{0,direction} }[/math] and [math]\displaystyle{ \,d_{1,direction} }[/math]
Then, assign the point to the class having the least [math]\displaystyle{ \,d_{i,center} + d_{i,direction} }[/math], where [math]\displaystyle{ \,i\in\{0, 1\} }[/math], so that it is closest to the center of that class whilst taking into consideration of reducing its chance of being an outlier in that class.


I would like to evaluate the effectiveness of my idea / algorithm as compared to the LDA / QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).