proposal Fall 2010

From statwiki
Jump to navigation Jump to search

Project 1 : Classifying New Data Points Using An Outlier Approach

By: Yongpeng Sun


Intuition:

In LDA, we assign a new data point to the class having the least distance to the center. At the same time however, it is desirable to assign a new data point to a class so that it is less of an outlier in that class as compared to every other class. To this end, compared to every other class, a new data point should be closer to the center of its assigned class and at the same time also, after suitable weighting has been done, be closer to the directions of variation of its assigned class.


Suppose there are two classes 0 and 1 both having [math]\displaystyle{ \,d }[/math] dimensions, and a new data point is given. To assign the new data point to a class, we can proceed using the following steps:

Step 1: For each class, find its center and its [math]\displaystyle{ \,d }[/math] directions of variation.


Step 2: For the new data point, with regard to each of the two classes, sum up the point's distance to the center and the point's distance to each of the [math]\displaystyle{ \,d }[/math] directions of variation weighted (multiplied) by the ratio of the amount of variation in that direction to the total variation in that class.


Step 3: Assign the new point to the class having the smaller of these two sums.


These 3 steps can be easily generalized to the case where the number of classes is more than 2 because, to assign a new data point to a class, we only need to know, with regard to each class, the sum as described above.


I would like to evaluate the effectiveness of my idea / algorithm as compared to LDA and QDA and other classifiers using data sets in the UCI database ( http://archive.ics.uci.edu/ml/ ).

Project 2: Apply Hadoop Map-Reduce to a Classification Method

By: Maia Hariri, Trevor Sabourin, and Johann Setiawan

Develop map-reduce processes that can properly classify large distributed data sets.

Potential projects:

1. Use Hadoop Map-Reduce to implement the Support Vector Machine (Kernel) classification algorithm.
2. Use Hadoop Map-Reduce to implement the LDA classification algorithm on a novel problem (e.g. forensic identification of handwriting.)


Project 3 : Hierarchical Locally Linear Classification

By: Pouria Fewzee

Extension of an intrinsic two-class classifier to a multi-class may be challenging, as the common approaches either remain some vague areas in the feature space, or are computationally inefficient. One may found linear classifier and support vector machines two well-known instances of intrinsic two-class classifiers, and the k-1 and k(k-1)/2-hyperplanes as two most common approaches for extension of their capabilities to multi-class tasks. The k-1 bothers from leaving vague areas in the feature space and even the k(k-1)/2 does not have this problem, it is not computationally efficient. Hierarchical classification is proposed as a solution. This not only improves the efficiency of the classifier, but also the suggested tree could provide the specialists with new outlooks in the field.

To build a general purpose classifier which adapts to different patterns, as much as demanded, is another purpose of this project. To realize this goal, locally linear classification is proposed. Performing the locality in classifier design is accomplished by means of utilizing a combination of fuzzy computation tools along with binary decision trees.


Project 4 : Cluster Ensembles for High Dimensional Clustering

By: Chun Bai, Lisha Yu

Clustering for unsupervised data exploration and analysis has been investigated for decades in machine learning. Its performance is directly influenced by the dimensionality. Data with high dimensionality pose two fundamental challenges for clustering algorithms. First, the data tend to be sparse in a high dimensional space. Second, there often exist noisy features that may mislead clustering algorithm.

The paper studies cluster ensembles for high dimensional data clustering. Three different approaches to constructing cluster ensembles are examined:

1. Random projection based approach
2. Combining PCA and random subsampling
3. Combing random projection with PCA

Moreover, four different consensus function for combing the clustering of the ensemble are examined:

1. Consensus Functions Using Graph Partitioning
-Instance-Based Graph Formulation (IBGF)
-Cluster-Based Graph Formulation (CBGF)
-Hybrid Bipartite Graph Formulation (HBGF)
2. Consensus Function Using Centroid-based Clustering (KMCF)

Using the datasets from UCI, It shows that ensembles generated by random projection perform better than those by PCA and further that this can be attributed to the capability of random projection to produce diverse base clustering. It has also shown that a recent consensus function based on bipartite graph partitioning achieves the best performance.

Project 5 : Texture Classification Using Compressive Sensing

By: Mohammad Rostami

Analysis of textures has many potential applications such as: remote sensing, biomedical image analysis and surface inspection. One of the major tasks in this area is texture classification. A major problem in texture analysis is that the natural textures are not often uniform, due to variations in scale, orientation, or other visual appearances which makes it harder to work with textures as compared to natural images. We can generally model natural images as deterministic signals while statistical models are more successful for texture synthesis. Though, many successful methods have been proposed for texture classifications [1], which have solved the fore mentioned problems considerably, but there is still need to develop better algorithms. Similar to other classification problems we will have a dimension reduction step. This step is necessary to make the feature vector invariant to the above mentioned variations. Many methods have been proposed in the literature to achieve this task. Most commonly, some kind of transform is used to map the signal to lower dimension space e.g.: wavelets, FDA, and statistical methods. Recently, Compressive sensing (CS) has also been used for this very purpose [2]. Compressive sensing is a technique for finding sparse solutions to underdetermined linear systems [3]. Consider a linear system with more unknowns than equations.

[math]\displaystyle{ \begin{align} \mathbf{Y_{n*1}} = \Phi_{n*m}\mathbf{X_{m*1}} \end{align} }[/math]

It is obvious that this problem does not have a unique solution but it can be shown if we assume sparsity as the prior knowledge about the solution, we can end up calculating a unique answer. This means that we can specify a sparse signal uniquely in lower dimensions. This demonstrates that CS has the potential to be used as dimension reduction. In contrast to classic CS where it is required to solve (1), here, instead we would like to design in order to transform our data to lower dimension space with the possibility of retrieving it uniquely. The main problem is that textures are not sparse signals so before doing this step we need to find a method to transform textures to sparse signals. After this step we can use existing methods for classification purpose. In this project I plan to carry out an extensive overview of existing texture classification methods, compressive sensing and the connection between two. I would survey current methods that use CS as a tool for classification and compare them in various aspects. Besides, I will also implement a novel idea to perform classification task similar to the proposed method in [2]. Principally, I hope to achieve better results by taking advantage of texture signal properties.

[1] T. Ojala, M. Pietikainen, T. Maenpaa: “Multi resolution gray-scale and rotation Invariant texture classification with local binary patterns” IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002)971–987

[2] L. Lui, P. Fieguth “Texture classification using compressed sensing.” 2010 Canadian Conference on Computer and Robot Vision (CRV)

[3] D. Donoho. “Compressed sensing,” IEEE Transactions on Information Theory, vol.52, pp.1289-1306, 2006.

Project 6 : Observation Conditions to Localization Accuracy Association

By: Haitham Amar


Vehicle localization is a key issue that has recently attracted a significant amount of attention in a wide range of applications. Navigation, vehicle tracking, Emergency Calling (eCall) and Location Based Services (LBS) are examples of emerging applications that have a great demand for location information. Indeed, the Global Positioning System (GPS) has been the de facto standard solution for the vehicle localization problem. Nevertheless, GPS based localization is inaccurate and unreliable due to GPS' inherent positional errors such as poor performance in vertical positioning and the prevalent horizontal movement, in addition to anomalies caused by line-of-sight occlusions and multipath issues in urban canyons.


It is well recognized in the literature that the performance of GPS receivers has a stochastic behavior, which is influenced by the observation conditions. For example, localization accuracy is high in open sky environments; however, in the presence of high rise buildings the localization accuracy is low and sometimes it is hard to be defined. Moreover, the GPS satellite signals may vanish if the vehicle goes through underpass or tunnel. The deficiency of obtaining consisting localization accuracy cannot be tolerated by many applications. Therefore, recent pieces of research work have attempted to evaluate the localization performance of various positioning techniques as a first step of improving the performance.


Since GPS technology is a crucial component in most of the vehicle localization techniques, the focus of this project will be on the classification of the performance of a GPS receiver while monitoring certain parameters sensitive to the observation conditions. In the literature, the sensitivity of two parameters (namely, the Signal to Noise Ratio of the received signal (SNR) from the GPS satellites and the Dilution of Precision (DOP) value has been investigated. Conceivably, the SNR is sensitive to the local environment of the receiver (High-rise buildings, trees, open sky, etc.). However, the DOP is reflecting the goodness of the geometric arrangement of the GPS satellites used as reference points in the localization process. Nevertheless, by looking at the figures of the SNR and DOP and comparing them with the localization errors, in many cases it is not trivial to draw a mapping function or classifier that can indicate the performance of the receiver.


Objectives of the project:

•Introducing more features similar to SNR and DOP, such as number of satellites used in the localization process, the mean and the variance of the SNR, the change in the satellites’ constellation, the speed of the vehicle, etc. These features are expected to support the process of discriminate analysis.

•Constructing a rich learning data base for GPS receiver measurements.

•Implementing different classification techniques to classify a number of performance margins for the GPS localizations. These classification techniques may not be limited to the ones we have been taught in the course.

•Studying the sensitivity of the classification techniques to the features that will be introduced.


Challenges of the project:

•A sufficient GPS data need to be collected in different environment conditions.

•Specifying the GPS performance margins that could be provided by the receiver in the different environment conditions.

•Despite our enthusiasm towards this research work, there is still a dark side of this new experimental work in terms of obtaining no major contribution as appose to the time spend on the investigation.


Current status: •A GPS receiver is already under our hands, which will allow us to collect as much data as we need.

•The communication with the hardware is already setup and we were able to capture some date on campus in various environment condition.

•Now we are working on extracting the features form the raw data collected by the GPS.