proposal Fall 2010: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 1: Line 1:
==Project 1 : Classifying New Data Points by Minimizing Their Chances of Being Outliers ==
==Project 1 : Classifying New Data Points Using An Outlier Approach ==
</noinclude>
</noinclude>
===By: Yongpeng Sun===
===By: Yongpeng Sun===


<br>
<br>
Intuition:  
Intuition:  


*In LDA, we assign a new data point to the class having the least distance to the center. At the same time, however, it is desirable to take into account how much the new point would seem like an outlier in each class.
In LDA, we assign a new data point to the class having the least distance to the center. At the same time however, it is desirable to assign a new data point to a class so that it is less of an outlier in that class as compared to every other class. To this end, compared to every other class, a new data point should be closer to the center of its assigned class and at the same time also be closer to the lines on which the directions of variation of its assigned class lie.
 
*For a new data point, we would like to assign it to the class such that it least seems to be an outlier in that class. To this end, the new data point should not only be as close as possible to the center of its assigned class, but it should also have the least possible orthogonal distance to the direction of maximum variation in its assigned class.  




Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:
Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:


::Step 1:     Use PCA to project the given data onto the 2-dimensional space for better visualization.
Step 1: Use PCA to project the training data onto the 2-dimensional space for better visualization.
    
    
Step 2:  For each class, find its center and the 2 lines on which its 2 directions of variation lie.
Step 3:  For the new data point, with regard to each of the two classes, find the sum of:
::: a:  its distance to the center of that class


::Step 2:     For each class, find its center and the direction of its maximum variation.
::: b:  its distance to the line on which the direction of maximum variation of that class lies weighted
:::(multiplied) by the ratio of the amount of maximum variation of that class to the total variation
:::of that class


::: c:  its distance to the line on which the direction of the second largest variation of that class lies weighted 
:::(multiplied) by the ratio of the amount of the second largest variation of that class to the total variation
:::of that class               


::Step 3:    For the new data point, find its distances to these two centers <math>\,d_{0,center}</math> and <math>\,d_{1,center}
::Then, assign the point to the class having the smaller of these two sums, so that the point is less of an outlier in the class that it is assigned to as compared to the other class.
</math>, and find its distances to these two directions <math>\,d_{0,direction}</math> and <math>\,d_{1,direction}</math>.
::Then, assign the point to the class having the least <math>\,d_{i,center} + d_{i,direction}</math>, where <math>\,i\in\{0, 1\}</math>, so that it is least likely to be an outlier in the class that it is assigned to.


This procedure can be easily generalized to the <math>\,k</math>-classes case where <math>\,k \ge 2</math>, because only the centers and the main directions of variation of the classes are needed to assign a new data point to a class.


These 3 steps can be easily generalized to the case where the number of classes is more than 2 because, to assign a
new data point to a class, we only need to know, with regard to each class, the sum as described above.




I would like to evaluate the effectiveness of my idea / algorithm as compared to the LDA / QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).
I would like to evaluate the effectiveness of my idea / algorithm as compared to LDA and QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).

Revision as of 10:33, 22 October 2010

Project 1 : Classifying New Data Points Using An Outlier Approach

By: Yongpeng Sun



Intuition:

In LDA, we assign a new data point to the class having the least distance to the center. At the same time however, it is desirable to assign a new data point to a class so that it is less of an outlier in that class as compared to every other class. To this end, compared to every other class, a new data point should be closer to the center of its assigned class and at the same time also be closer to the lines on which the directions of variation of its assigned class lie.


Suppose there are two classes 0 and 1, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:

Step 1: Use PCA to project the training data onto the 2-dimensional space for better visualization.

Step 2: For each class, find its center and the 2 lines on which its 2 directions of variation lie.

Step 3: For the new data point, with regard to each of the two classes, find the sum of:

a: its distance to the center of that class
b: its distance to the line on which the direction of maximum variation of that class lies weighted
(multiplied) by the ratio of the amount of maximum variation of that class to the total variation
of that class
c: its distance to the line on which the direction of the second largest variation of that class lies weighted
(multiplied) by the ratio of the amount of the second largest variation of that class to the total variation
of that class
Then, assign the point to the class having the smaller of these two sums, so that the point is less of an outlier in the class that it is assigned to as compared to the other class.


These 3 steps can be easily generalized to the case where the number of classes is more than 2 because, to assign a new data point to a class, we only need to know, with regard to each class, the sum as described above.


I would like to evaluate the effectiveness of my idea / algorithm as compared to LDA and QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).