From statwiki
Revision as of 18:53, 6 October 2011 by Cj2chow (talk | contribs) (Added Project 12)
Jump to: navigation, search

==Project 1 : Title == Classification of Disease Status

By: Lai,ChunWei and Greg Pitt

For our classification project, we are proposing an application in the medical diagnosis field: For each patient or lab animal, there will be results from a large number of genetic and/or chemical tests. We should be able to predict the disease state of the patient/animal, based on the presence or absence of certain biomarkers and/or chemical markers.

Our project work will include the reduction of dimensionality, and the development or one or more classification functions, with the objectives of minimizing the error rate and also reducing the number of markers required in order to make good predictions. Our results could be used at the patient level, to help make accurate diagnoses, and at the population health level, to make epidemiological surveys of the prevalence of certain medical conditions. In both cases, the results should enable the healthcare system to make better decisions regarding the deployment of scarce healthcare resources.

Our methodology will be chosen soon, after we have seen a few more examples in class. If time permits, we will also attempt a novel classification procedure of our own design.

Currently we have access to a dataset from the SSC data mining section, and we hope to be able to get access to some similar, but larger, datasets before the end of the term.

The software tools that we use will probably include Matlab, Python, and R.

We would like to obtain publishable results if possible, but this is not a primary objective.

Proposal 2: The Golden Retrieber

By Cameron Davidson-Pilon and Jenn Smith

Our goal of this project is to determine statistical results from the population of Twitter users that have a specific celebrity in their display picture. Our algorithm will scan through Twitter's display pictures, and attempt to determine whether a display picture features Canada's most famous icon: Justin Beiber. We will hope that most images contain his trademark swoosh hairstyle, as much of or classification will rely on such handsome features.

After we determine, with some probability of error, that a user has a Beiber Display Pics (BDP), we can then do a statistical analysis on the sample population's tweets, hashtags etc.

Applications of this algorithm include be the Twitter behaviour of Bieber fans. It can be used in an app for companies that want to target such demographics.

We will be using Matlab and Python.

Project 3 : Classifying Melanoma Images

By: Robert Amelard

The current diagnosis of melanoma is a very subjective method. Some popular methods for diagnosing are the 7-point checklist and the ABCD rubric. They are both based on very subjective criteria, such as the "irregularity" of a skin lesion. My project will attempt to classify an input image containing a skin lesion into the class "benign" or "malignant" based on features that are regarded as high risk in these rubrics. This will help doctors come to more justified diagnoses of patients.

Project 4 : classifying trademarks

By: Chen Wang; YuanHong Yu; Jia Zhou

Our group decided to use statistical classification methods to distinguish various types of trademarks within an industry, and thus attempt to determine which is the most popular color that is being used by manufactures globally. We would like to scan the selected trademarks first, after obtaining all the desired trademarks, we then can do further statistical analysis. Our major goals are to help customers easily distinguish the specific industry by just looking at the color the trademark and also help new entrants who want to enter the market have a better knowledge of their competitors. The possible software and tools we would like to use include: R, Matlab.

Project 5 : Distributed Classification and Data Fusion in Wireless Sensor Networks

By : Mahmoud Faraj

Wireless sensor networks (WSNs) are a recently emerging technology consisting of a large number of battery powered sensor nodes interconnected wirelessly and capable of monitoring environments, tracking targets, and performing many other critical applications. The design and deployment of such type of network are challenging tasks due to the imperfect nature of the communicated nodes (i.e., sensors) in the WSNs. The dramatic depletion of the sensor’s energy while performing the regular tasks (e.g. sensing, processing, receiving and transmitting information) constitutes a major threat of shortening the lifetime of the network. That is due to the limited amount of energy in the sensor which is constrained by the dimensions of these sensors. The lack of energy makes the lifetime of the network shorter. Also, the death of some nodes causes partitioning the network. As a result, some nodes become not able to communicate with others to accomplish the ultimate goal of the remotely deployed network.

In our research work, we propose one of the techniques learned in the course to be used for performing distributed classification of a moving target (e.g. vehicle, animal, or person). Each sensor node will be able to classify the moving target and then track it in the WSN field. In order to conserve power and extend the lifetime of the network, we also propose distributed (in-network) data fusion by using Distributed Kalman Filter where the data are fused in the network instead of having all the data transmitted to the fusion center (sink). Each node processes the data from its own set of sensor, communicating with neighbouring nodes to improve the classification and location estimates of the moving target. Simulation results will be provided to demonstrate the significant advantages of using distributed classification and data fusion and also to show the improvement of the WSN as a whole.

Project 6 : Skin Classification

By : Jeffrey Glaister

My goal for this project is to classify segments of an oversegmented image as skin or skin lesion (a two-class problem). The overall goal and application is to automatically segment the skin lesion in pictures of patients at risk of melanoma, unsupervised. A standard segmentation algorithm will be applied to the image to oversegment it. Then, the resulting segments will have classified as normal skin or not normal skin and be merged. Of particular interest is texture and colour classification, since skin and lesions differ slightly in texture and colour. Other possible features include spatial location and initial segment size. Time permitting, novel texture and colour classifiers will be investigated.

I have access to skin lesion images from a public database, some of which have been manually contoured to test the algorithm.

I will be using Matlab.

Project 7 : System Combination for Multi-Class Classification

By : Samson Hu, Blair Rose, Mikhail Targonski

Our goal is to combine SVM with other classification methods. We hope to optimize performance and increase the diversity in resulting ensemble. We will use this method for voice recognition as seen in Ferror 09 in JMLR.

The possible software and tools include but not limited to: Matlab, R.

Project 8 : Reproducing the Results of a 1985 Paper

By : Mohamed El Massad

For the CM 763 final project, I intend to reproduce the results of Keller‘s 1985 IEEE paper in which the Fuzzy K-Nearest Neighbour algorithm was first introduced. To develop the algorithm, the authors of the paper introduced the theory of fuzzy-sets into the well-known K-nearest neighbour decision rule. Their motivation was to address a limitation with the said rule that it gives the training samples equal weight in deciding the class memberships of the patterns to be classified, regardless of the degree to which these samples are representative of their own classes. The authors proposed three methods for assigning initial fuzzy memberships to the samples in the training data set, and presented experimental results and comparisons to the crisp version of the algorithm showing how their proposed one outperforms it in terms of the error rate. They also show that their algorithm compares well against other more sophisticated pattern classification procedures, including the Bayes classifier and the perceptron. Finally, they develop a fuzzy analog to the nearest prototype algorithm.

The authors of the paper used FORTRAN to implement their proposed algorithms, but I will probably use MATLAB to do that and maybe some C as well.

Project 9 : Domain adaptation

By : Nika Haghtalab

Classification methods usually perform well when the testing and training data are drawn form the same distribution. However, in many applications we use a labeled training data from some fixed source, but we aim to use our model on different testing targets. In this project, I will focus on Domain adaptation which addresses the above problem, in different settings. In particular, I will review following papers.

J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In Proc. of EMNLP '06, 2006.

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of Representations for Domain Adaptation. In Proc of NIPS'06, 2006

X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classication: A deep learning approach. In Proc of ICML'11, 2011

Project 10 : Classification of 3-dimensional objects

By : Kenneth Webster, Soo Min Kang, and Hang Su

Solvability of 3-dimensional physical systems. Can computers determine whether a jumbled rubix cube can be solved?

Project 11: Learning in Robots

By: Guoting (Jane) Chang


One of the long term goals in robotics is for robots (such as humanoid robots) to become useful in human environments. In order for robots to be able to perform different services within society, they will need the ability to carry out new tasks and to adapt to changing environments. This in turn requires robots to have a capacity for learning. However, existing implementations of learning in robots tend to focus on specific tasks and are not easily extended to different tasks and environments [1].

Proposed Project

The purpose of the proposed work is to continue developing an initial framework for a learning system that is not task or environment specific. Such a generalized learning strategy should be achievable through hierarchical knowledge abstraction and appropriate knowledge representation. At the lowest level of the hierarchy, vision techniques will be used to extract features (such as colors, contours and position information) from raw input video data. On the next level of the hierarchy, the extracted features will be combined using clustering techniques such as self-organizing maps to perform object recognition. Furthermore, in order to learn to recognize motions shown in the videos, techniques such as incremental decision trees should be investigated for performing guided clustering (i.e., clustering based on some metric). At the higher levels of the hierarchy, the sequence of motions and objects involved in the video should be represented using connectionist models such as directed graphs.

The main focus of the proposed work for this project will be on the clustering of observed motions, as it is most closely related to the classification techniques that will be taught in class. An incremental decision tree is tentatively being considered for this, as the goal is to determine whether a newly observed motion belongs to a group of motions that has been seen before or whether it is a new motion and the knowledge representation should be updated to include it. Matlab or C/C++ code will most likely be used for this project.


[1] A. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in Third IEEE International Conference on Development and Learning. San Diego, California, USA: IEEE, 2004.

Project 12: Stock price forecasting

By: Zhe (Gigi) Wan, Chi-Yin (Johnny) Chow


Under Efficient Market Hypothesis, stock prices are completely unpredictable and any information that are publicly available (weak-form of EMH) is already reflected in the stock prices. However, an experienced trader can have a "feel" or prediction of the prices in the future, based on the history of prices or other factors.

In this project, we will apply component analysis techniques to identify or recognize any significant patterns and then employ the Support Vector Machine technique as the prediction model. After applying the model to the data, we plan to evaluate the accuracy of the prediction, and compare it with other state-of-the-art techniques.


[1] Ince, H., Trafalis, T.B., "Kernel principal component analysis and support vector machines for stock price prediction", Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference.

[2] Chi-Jie Lu, Jui-Yu Wu, Cheng-Ruei Fan ; Chih-Chou Chiu, "Forecasting stock price using Nonlinear independent component analysis and support vector regression", Industrial Engineering and Engineering Management, 2009. IEEM 2009. IEEE International Conference.