proposal for STAT946 projects Fall 2010: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 6: Line 6:
Intuition:  
Intuition:  


When we have a large number of data points and we wish to sample a small portion of them as landmarks, using simple random sampling (SRS) does not guarantee good results. The reason is that, when the sample is very small, SRS sometimes results in a sample that is a small cluster within the data space that would not be representative of all of the data. We would like our landmarks to be as representative of the data set as possible. Thus, we can, using
When we have a large number of data points and we wish to sample a small portion of them as landmarks, using simple random sampling (SRS) does not guarantee good results. The reason is that, when the sample is very small, SRS sometimes results in a sample that is a small cluster within the data space that would not be representative of all of the data. We would like our landmarks to be as representative of the data set as possible. Thus, we can, starting from the center of the data space, alternatively obtain landmarks that are as near as possible to the center and as far away as possible from the center, effectively removing each newly-sampled landmark by shrinking the distances matrix. During these steps, the closest point from the center and the furthest point from the center from the remaining data space gradually converge towards each other. This is because, during the steps, the distance between the nearest point from the center and the center will gradually increase while at the same time the distance between the furthest point from the center and the center will gradually decrease as the nearest point from the center and the furthest point from the center are effectively removed from the data space.  
each landmark that we obtain, generate two new landmarks using that landmark; and these two newly-generated landmarks can be the nearest point and the furthest point from that landmark. The resulting landmarks obtained from this procedure should then be very representative of the entire data set.  
 
 
 
For this procedure, we require the following items:
 
- The distances matrix containing the pair-wise distances between the data points. This matrix is constructed only once, and it is updated a number of times during the procedure of obtaining the landmarks.
 
- a queue containing landmarks that we obtain and add to it. For each point at the front of the queue, we use it to find two new landmarks, and then we delete it from the queue.
 
- a list containing the landmarks, to which we add all of the landmarks that we obtain.  


Before starting the procedure, we require the distances matrix containing the pair-wise distances between the data points. This matrix is constructed only once, and it is updated two times during each step.




Line 34: Line 24:


<noinclude>
<noinclude>
( NOTE: STILL BEING EDITED )

Revision as of 13:19, 28 October 2010

Project 1 : Sampling Landmarks Using a Convergence Approach

By: Yongpeng Sun


Intuition:

When we have a large number of data points and we wish to sample a small portion of them as landmarks, using simple random sampling (SRS) does not guarantee good results. The reason is that, when the sample is very small, SRS sometimes results in a sample that is a small cluster within the data space that would not be representative of all of the data. We would like our landmarks to be as representative of the data set as possible. Thus, we can, starting from the center of the data space, alternatively obtain landmarks that are as near as possible to the center and as far away as possible from the center, effectively removing each newly-sampled landmark by shrinking the distances matrix. During these steps, the closest point from the center and the furthest point from the center from the remaining data space gradually converge towards each other. This is because, during the steps, the distance between the nearest point from the center and the center will gradually increase while at the same time the distance between the furthest point from the center and the center will gradually decrease as the nearest point from the center and the furthest point from the center are effectively removed from the data space.

Before starting the procedure, we require the distances matrix containing the pair-wise distances between the data points. This matrix is constructed only once, and it is updated two times during each step.


The procedure for obtaining the landmarks can be described by the following steps:

Step 1: Find the point closet to the center, and add this point to the to-do queue and the landmarks. Update the distances matrix by removing the row and the column regarding this point.
Step 2: For the front-most point in the to-do queue, find the closest point and the furthest point from it, and add these two points to the to-do queue and the landmarks. Update the distances matrix by removing the row and the column regarding each of these two points. Remove the front-most point from the to-do-queue.
Step 3: If the number of landmarks suffices, then stop. Otherwise, go back to step 2.


Using 27-dimensional phoneme data, I would like to compare the efficiency of my method of sampling landmarks against the most common approach of sampling landmarks, which is by using simple random sampling (SRS).



( NOTE: STILL BEING EDITED )