proposal Fall 2010

From statwiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Project 1 : Classifying New Data Points Using An Outlier Approach

By: Yongpeng Sun



Intuition:

In LDA, we assign a new data point to the class having the least distance to the center. At the same time however, it is desirable to assign a new data point to a class so that it is less of an outlier in that class as compared to every other class. To this end, compared to every other class, a new data point should be closer to the center of its assigned class and at the same time also be closer to the lines on which the directions of variation of its assigned class lie.


Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:

Step 1: Use PCA to project the training data onto the 2-dimensional space for better visualization.

Step 2: For each class, find its center and the 2 lines on which its 2 directions of variation lie.

Step 3: For the new data point, with regard to each of the two classes, find the sum of:

a: its distance to the center of that class
b: its distance to the line on which the direction of maximum variation of that class lies weighted
(multiplied) by the ratio of the amount of variation in the direction of the maximum variation of that class to the total variation
of that class
c: its distance to the line on which the direction of the second largest variation of that class lies weighted
(multiplied) by the ratio of the amount of variation in the direction of the second largest variation of that class to the total variation of that class
Then, assign the point to the class having the smaller of these two sums, so that the point is less of an outlier in the class that it is assigned to as compared to the other class.


These 3 steps can be easily generalized to the case where the number of classes is more than 2 because, to assign a new data point to a class, we only need to know, with regard to each class, the sum as described above.


I would like to evaluate the effectiveness of my idea / algorithm as compared to LDA and QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).