f11Stat841proposal

Project 1 : Classification of Disease Status

By: Lai,ChunWei and Greg Pitt

For our classification project, we are proposing an application in the medical diagnosis field: For each patient or lab animal, there will be results from a large number of genetic and/or chemical tests. We should be able to predict the disease state of the patient/animal, based on the presence or absence of certain biomarkers and/or chemical markers.

Our project work will include the reduction of dimensionality, and the development or one or more classification functions, with the objectives of minimizing the error rate and also reducing the number of markers required in order to make good predictions. Our results could be used at the patient level, to help make accurate diagnoses, and at the population health level, to make epidemiological surveys of the prevalence of certain medical conditions. In both cases, the results should enable the healthcare system to make better decisions regarding the deployment of scarce healthcare resources.

Our methodology will be chosen soon, after we have seen a few more examples in class. If time permits, we will also attempt a novel classification procedure of our own design.

Currently we have access to a dataset from the SSC data mining section, and we hope to be able to get access to some similar, but larger, datasets before the end of the term.

The software tools that we use will probably include Matlab, Python, and R.

We would like to obtain publishable results if possible, but this is not a primary objective.

Proposal 2: The Golden Retrieber

By Cameron Davidson-Pilon and Jenn Smith

Our goal of this project is to determine statistical results from the population of Twitter users that have a specific celebrity in their display picture. Our algorithm will scan through Twitter's display pictures, and attempt to determine whether a display picture features Canada's most famous icon: Justin Beiber. We will hope that most images contain his trademark swoosh hairstyle, as much of or classification will rely on such handsome features.

After we determine, with some probability of error, that a user has a Beiber Display Picture (BDP), we can then do a statistical analysis on the sample population's tweets, hashtags etc.

Applications of this algorithm include the Twitter behaviour of Bieber fans. It can be used in an app for companies that want to target such demographics.

We will be using Matlab and Python.

Project 3 : Classifying Melanoma Images

By: Robert Amelard

Currently, the method of manually diagnosing melanoma is very subjective. Dermatologists essentially look at a skin lesion and determine from their experience if it looks malignant or benign. Some popular methods for diagnosis are the 7-point checklist and the ABCD rubric. They are both based on very subjective criteria, such as the "irregularity" of a skin lesion. My project will attempt to classify an input image containing a skin lesion into the class "melanoma" or "not melanoma" based on features that are regarded as high risk with regard to these rubrics. This will help doctors come to more a quantitative, objectively justifiable diagnosis of patients.

Project 4 : classifying trademarks

By: Chen Wang; YuanHong Yu; Jia Zhou

Our group decided to use statistical classification methods to distinguish various types of trademarks within an industry, and thus attempt to determine which is the most popular color that is being used by manufactures globally. We would like to scan the selected trademarks first, after obtaining all the desired trademarks, we then can do further statistical analysis. Our major goals are to help customers easily distinguish the specific industry by just looking at the color the trademark and also help new entrants who want to enter the market have a better knowledge of their competitors. The possible software and tools we would like to use include: R, Matlab.

Project 5 : Distributed Classification and Data Fusion in Wireless Sensor Networks

By : Mahmoud Faraj

Wireless sensor networks (WSNs) are a recently emerging technology consisting of a large number of battery powered sensor nodes interconnected wirelessly and capable of monitoring environments, tracking targets, and performing many other critical applications. The design and deployment of such type of network are challenging tasks due to the imperfect nature of the communicated nodes (i.e., sensors) in the WSNs. The dramatic depletion of the sensor’s energy while performing the regular tasks (e.g. sensing, processing, receiving and transmitting information) constitutes a major threat of shortening the lifetime of the network. That is due to the limited amount of energy in the sensor which is constrained by the dimensions of these sensors. The lack of energy makes the lifetime of the network shorter. Also, the death of some nodes causes partitioning the network. As a result, some nodes become not able to communicate with others to accomplish the ultimate goal of the remotely deployed network.

In our research work, we propose one of the techniques learned in the course to be used for performing distributed classification of a moving target (e.g. vehicle, animal, or person). Each sensor node will be able to classify the moving target and then track it in the WSN field. In order to conserve power and extend the lifetime of the network, we also propose distributed (in-network) data fusion by using Distributed Kalman Filter where the data are fused in the network instead of having all the data transmitted to the fusion center (sink). Each node processes the data from its own set of sensor, communicating with neighbouring nodes to improve the classification and location estimates of the moving target. Simulation results will be provided to demonstrate the significant advantages of using distributed classification and data fusion and also to show the improvement of the WSN as a whole.

Project 6 : Skin Classification

By : Jeffrey Glaister

My goal for this project is to classify segments of an oversegmented image as skin or skin lesion (a two-class problem). The overall goal and application is to automatically segment the skin lesion in pictures of patients at risk of melanoma, unsupervised. A standard segmentation algorithm will be applied to the image to oversegment it. Then, the resulting segments will have classified as normal skin or not normal skin and be merged. Of particular interest is texture and colour classification, since skin and lesions differ slightly in texture and colour. Other possible features include spatial location and initial segment size. Time permitting, novel texture and colour classifiers will be investigated.

I have access to skin lesion images from a public database, some of which have been manually contoured to test the algorithm.

I will be using Matlab.

Project 7 : System Combination for Multi-Class Classification

By : Samson Hu, Blair Rose, Mikhail Targonski

Our goal is to combine SVM with other classification methods. We hope to optimize performance and increase the diversity in resulting ensemble. We will use this method for voice recognition as seen in Ferror 09 in JMLR.

The possible software and tools include but not limited to: Matlab, R.

Project 8 : Reproducing the Results of a 1985 Paper

By : Mohamed El Massad

For the CM 763 final project, I intend to reproduce the results of Keller‘s 1985 IEEE paper in which the Fuzzy K-Nearest Neighbour algorithm was first introduced. To develop the algorithm, the authors of the paper introduced the theory of fuzzy-sets into the well-known K-nearest neighbour decision rule. Their motivation was to address a limitation with the said rule that it gives the training samples equal weight in deciding the class memberships of the patterns to be classified, regardless of the degree to which these samples are representative of their own classes. The authors proposed three methods for assigning initial fuzzy memberships to the samples in the training data set, and presented experimental results and comparisons to the crisp version of the algorithm showing how their proposed one outperforms it in terms of the error rate. They also show that their algorithm compares well against other more sophisticated pattern classification procedures, including the Bayes classifier and the perceptron. Finally, they develop a fuzzy analog to the nearest prototype algorithm.

The authors of the paper used FORTRAN to implement their proposed algorithms, but I will probably use MATLAB to do that and maybe some C as well.

Project 9 : Domain adaptation

By : Nika Haghtalab

Classification methods usually perform well when the testing and training data are drawn form the same distribution. However, in many applications we use a labeled training data from some fixed source, but we aim to use our model on different testing targets. In this project, I will focus on Domain adaptation which addresses the above problem, in different settings. In particular, I will review following papers.

J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In Proc. of EMNLP '06, 2006.

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of Representations for Domain Adaptation. In Proc of NIPS'06, 2006

X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classication: A deep learning approach. In Proc of ICML'11, 2011

Project 10 : Classification of 3-dimensional objects

By : Kenneth Webster, Soo Min Kang, and Hang Su

Solvability of 3-dimensional physical systems. Can computers determine whether a jumbled rubix cube can be solved?

Project 11: Learning in Robots

By: Guoting (Jane) Chang

Background

One of the long term goals in robotics is for robots (such as humanoid robots) to become useful in human environments. In order for robots to be able to perform different services within society, they will need the ability to carry out new tasks and to adapt to changing environments. This in turn requires robots to have a capacity for learning. However, existing implementations of learning in robots tend to focus on specific tasks and are not easily extended to different tasks and environments [1].

Proposed Project

The purpose of the proposed work is to continue developing an initial framework for a learning system that is not task or environment specific. Such a generalized learning strategy should be achievable through hierarchical knowledge abstraction and appropriate knowledge representation. At the lowest level of the hierarchy, vision techniques will be used to extract features (such as colors, contours and position information) from raw input video data. On the next level of the hierarchy, the extracted features will be combined using clustering techniques such as self-organizing maps to perform object recognition. Furthermore, in order to learn to recognize motions shown in the videos, techniques such as incremental decision trees should be investigated for performing guided clustering (i.e., clustering based on some metric). At the higher levels of the hierarchy, the sequence of motions and objects involved in the video should be represented using connectionist models such as directed graphs.

The main focus of the proposed work for this project will be on the clustering of observed motions, as it is most closely related to the classification techniques that will be taught in class. An incremental decision tree is tentatively being considered for this, as the goal is to determine whether a newly observed motion belongs to a group of motions that has been seen before or whether it is a new motion and the knowledge representation should be updated to include it. Matlab or C/C++ code will most likely be used for this project.

Reference

[1] A. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in Third IEEE International Conference on Development and Learning. San Diego, California, USA: IEEE, 2004.

Project 12: Stock price forecasting

By: Zhe (Gigi) Wang, Chi-Yin (Johnny) Chow

Proposal

Under Efficient Market Hypothesis, stock prices are completely unpredictable and any information that are publicly available (weak-form of EMH) is already reflected in the stock prices. However, an experienced trader can have a "feel" or prediction of the prices in the future, based on the history of prices or other factors.

In this project, we will apply component analysis techniques to identify or recognize any significant patterns and then employ the Support Vector Machine technique as the prediction model. After applying the model to the data, we plan to evaluate the accuracy of the prediction, and compare it with other state-of-the-art techniques.

References

[1] Ince, H., Trafalis, T.B., "Kernel principal component analysis and support vector machines for stock price prediction", Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference.

[2] Chi-Jie Lu, Jui-Yu Wu, Cheng-Ruei Fan ; Chih-Chou Chiu, "Forecasting stock price using Nonlinear independent component analysis and support vector regression", Industrial Engineering and Engineering Management, 2009. IEEM 2009. IEEE International Conference.

Project 13: UFO Sightings

By Vishnu Pothugunta

There have been a lot of UFO sightings in the past decade. The goal is to use classification methods and predict where and when could the UFO sightings happen. From the past data about the ufo sightings, we can also try to predict the shape of the UFO and the duration of the sighting.

Project 14: Identifying Accounting Fraud Using Statistical Learning

By Daniel Severn

Proposal

By constructing a data set from key financial ratios from financial statements I hope to create a statistical classifier that can accurately identify the companies who engage in accounting fraud. The following is a paper with a similar goal. http://www.waset.org/journals/ijims/v3/v3-2-13.pdf. I would like to use methods from class but perhaps also the C4.5 method as in the previously linked paper it provided a greatly superior classifier.

Relevance

This is of obvious relevance to securities commissions but it also relevant useful to any investor. By adding such a classifier to some investors typical methods they can identify suspect companies and profit from the high probability of a stock decline when the fraud is uncovered. This in turn will reduce the stock price of companies that engage in heavy management and manipulation of their financial statements. This would reduce/remove the incentive of managing/manipulating financial statements which benefits the entire financial system and thus economy.

Challenges

This analysis requires a quality data set. Finding/creating such a data set may be challenging.

Project 15 : A survey on artificial neural networks (ANN)

By: Hojat Abdolanezhad & Carolyn Augusta

A brief history and the function of ANN, explanation of common terms used in ANN (perceptron, back propagation, sematic net, etc.) and general philosophy of neural networks. types of ANNs researched in the past, leading into the present. We will mention new trends in this area. Also an application of neural networks in classification will be discussed.

A note Artificial neural networks (ANN) have been very useful to solve real world problems. In Economics, ANNs can be applied to predict the profit, market trends, and price levels based on the market’s databases from the past. In industry, engineers can apply ANNs to solve many nonlinear engineering problems such as classifications, prediction, pattern recognition, where the tasks are very difficult to solve using normal mathematical tools.

Useful papers:

H. White, “Learning in artificial neural networks: A statistical perspective,” Neural Comput., vol. 1, pp. 425–464, 1989.

E. Wan, “Neural network classification: A Bayesian interpretation,” IEEE Trans. Neural Networks, vol. 1, no. 4, pp. 303–305, 1990.

P.G. Zhang, “Neural Networks for Classification: A Survey,” IEEE Trans. Systems, Man, and Cybernetics, vol. 30, no. 4, pp. 451-462, 2000. Christopher Bishop. Neural Networks for Pattern Recognition. Oxford University Press, London, UK, 1995

Ripley, B. D. (1994a) Neural networks and related methods for classification (with discussion). Journal of the Royal Statistical Society series B 56, 409–456.

By Gobaan Raveendran & Daniel Nicoara

Our project will focus on crawling the internet for various news articles from many different sources and then classifying these into sets of either left or right wing blogs. The projects focus will be on feature extraction, and determining which features are important for classification.

For supervised data, we will either automatically assign classes based on domain and see if the article fits in the predicted domain, or we will use an external system such as a topic model.

Project 17: Classification of harp seals

By Zhikang Huang, Haoyang Fu， Mengfei Yang

Seals possess varied repertoires of underwater vocalisations. Geographic variation in call types have been reported for Weddell and bearded seal species, and the variations have been attributed to the isolation of breeding populations within these species.

Our project will focus on harp seals (Phoca groenlandica), and in particular the herds from Jan Mayen Island, Gulf of St. Lawrence and Front. Our goal is to classify harp seals using the data which were obtained from underwater recordings of harp seals in these three herds. Nine hundred calls from each of the three herds are used to be our training set, and we will use three hundred calls as our test set to estimate our predictive model.

We will use tree models or logistic regression models as our predictive model to do the classification, and we will select the best model with the smallest error rate. Also, our model will include seven variables which are as follows:

ELEMDUR - this is the duration of a single element of a harp seal underwater vocalisation. It is measured in milliseconds.

INTERDUR - this is the time between elements in multiple element calls. It is measured in milliseconds. Note that not all calls have multiple elements so this variable is absent in single element calls. Where absent, a value of NA is recorded in the data.

NO_ELEM - this is the number of elements of the call. In harp seals all of the elements within a single call are similar and the spacing between them is constant.

STARTFREQ - this is the pitch at the start of the call or the highest pitch if the call has an extremely short duration (call shape 0 below).

ENDFRE - this is the pitch at the end of the call or the lowest pitch if the call has an extremely short duration (call shape 0).

WAVEFORM - this codes a series of waveform shapes (a plot of amplitude vs time) which lie more or less along a continuum. The waveform shapes are:

   frequency modulated sinusoidal          9
   slight frequency modulated and complex  8
   sinusoidal (pure tone)                  7
   complex (irregular waveform)            5
   amplitude pulses 4 burst pulses         3
   knock (short burst pulse)               2
   click (very short duration)             1

CALLSHAP - this codes a series of call shapes as they would appear in a sonogram spectral analysis (a plot of frequency vs time).

HERD - this is the herd from which the recordings were obtained. The classification recording of herds is as follows:

   Jan Mayen Island        1 
   Gulf of St. Lawrence    2 
   Front                   3

Reference

Terhune, J.M. (1994) Geographical variation of harp seal underwater vocalisations, Can. J. Zoology 72(5) 892-897.

Statistics Society of Canada, http://www.ssc.ca/en/education/archived-case-studies/seal-vocalisations.

Project 18 : Classifying Vehicle License Plates

By : Jun Kai Shan, Su Rong You

Our group will focus on using statistical methods to classify Canadian vehicle license plates. We will use MATLAB to find a way to classify letters, digits and province of the license plates. We will take pictures of the license plates of different cars and use the pictures as our data. We will use R to do further statistical analysis.

Project 19 : Ice/No-Ice Classification

By : Steven Leigh

Modern satellites collect massive amounts of earth imagery limiting the usefulness of humans for image interpretation. This project will attempt to tackle the problem of identifying ice and open water automatically from satellite imagery. Multimodal data will be considered such as multipolar SAR data, optical data and thematic data to name a few.

Project 20: A survey on Support Vector Machine

By Monsef Tahir

The support vector machine is a training algorithm for learning classification and regression rules from data, for example the SVM can be used to learn polynomial, radial basis function (RBF) and multi-layer perceptron (MLP) classifiers. SVMs are based on the structural risk minimisation principle, closely related to regularisation theory. This principle incorporates capacity control to prevent over-fitting and thus is a partial solution to the bias-variance trade-off dilemma.

In this project, a survey will be made on SVM and a comparison with other tools in terms of classification and prediction will be performed as well.

Project 21: Good/Bad-day Classification

By Carl J. Wensater

The proposed project is to over the course of two weeks gather and parameterize data about daily activities such as workout, food intake, hours of sleep, amount of spare time etc. This data will be used as training data for a two class classifier that will try to distinguish good from bad days. If the classification is successful statistical analysis can be used to identify the crucial components of a good day.

Project 22: SVM-Based Classification of Peer-to-Peer Internet Traffic

By: Talieh Seyed Tabatabaei

In recent years, Peer-to-Peer (P2P) file-exchange applications have overtaken Web applications as the major contributor of traffic on the Internet. Recent estimates put the volume of P2P traffic at 70% of the total broadband traffic. P2P is often used for illegally sharing copyrighted music, video, games, and software. The legal ramification of this traffic combined with its aggressive use of network resources has necessitated a strong need for identification of network traffic by application type. This task, referred to as traffic classification, is a pre-requisite to many network management and traffic engineering problems.

In this project least squared support vector machines (LS-SVM) is going to be adopted in order to identify p2p traffic using some flow-based statistical features.

Project 23: A survey on a fast learning algorithm on deep belief nets

By: Seyed Seifi

Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases.

I am trying to have a survey on the novel learning algorithm which is proposed by Prof. Geoffrey E. Hinton on deep belief nets.

http://www.cs.toronto.edu/~hinton/

http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf

Project 24: Feature Extraction using SVM

By: Ad Tayal

Explore the idea of using support vector machines for joint feature extraction and classification.