statf09841Proposal: Difference between revisions

From statwiki
Jump to navigation Jump to search
m (Conversion script moved page Statf09841Proposal to statf09841Proposal: Converting page titles to lowercase)
 
(30 intermediate revisions by 10 users not shown)
Line 27: Line 27:
<noinclude>
<noinclude>


==Project 2 : ==
==Project 2 : A modifeid CART algorithm with soft nodes==
</noinclude>
</noinclude>
===By: Jiheng Wang===
===By: Jiheng Wang===
The tree growing algorithms are often claimed to emulate regression approaches in their ability to handle both continuous and discrete variables. However, the treatment of continuous variables remains somewhat unsatisfactory. For example, the search of the optimal question for a continuous variables is usually reduced to the search of a cut point among all the observed values.


We know that the classical CART algorithm for generating a decision tree is known as the recursive process: given the data represented at a node, either declares that the node to be a leaf or searches for another question to use to split the data into subsets.
We will develop a modified CART algorithm and compare it to the standard tree algorithm CART on some classical data sets, which are freely available from the Internet. A natural approach to tree growing is replacing hard nodes with soft ones. To be specific, when the decision is based on a continuous variable, we apply for a probabilistic decision function instead of a simple binary split. Basically, the logistic function is a good choice of the decision function with respect to its sigmoid shape. Our first aim is to develop a efficient algorithm for computing the probabilistic decision function at every soft node. In CART, the tree grows following a greedy search to maximizes the Information Gain. Here we still use it as our criterion with a little bit of generalization. The following work will compare the performance of hard nodes and soft nodes due to the fact that soft nodes are not guaranteed to yield a better solution. Thus a strategy between the soft nodes and hard nodes, or soft trees and and hard trees should be discussed.
==Project 3 : Identifying process faults of waste water treatment plant ==
</noinclude>
===By: Yao Yao, Min Chen, Jiaxi Liang, Zhenghui Wu===
===Objective=== 
To classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.
===Background Information===
Liquid waste treatment plant and system operators, also known as waste water treatment plant and system operators, remove harmful pollutants from domestic and industrial liquid waste so that it is safe to return to the environment. There are four stages in the water treatment process: plant input, primary settler, secondary settler and plant output.
Operators read, interpret, and adjust meters and gauges to make sure that plant equipment and processes are working properly. Operators control chemical-feeding devices, take samples of the water or waste water, perform chemical and biological laboratory analyses, and adjust the amounts of chemicals in the water. We use sensors to sample and measure water quality.
===Data Description===
This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The data includes 527 data points and 38 variables, recording the water quality of each stage of the treatment process.
===Techniques===
*Principal Component Analysis(PCA)
*Locally Linear Embedding(LEE)
*Isomap
*Cluster Analysis and Conceptual Clustering
*Linear Discriminant Analysis(LDA/FLDA)
*Linear and Logistic Regression
*Neural Network(NN)
</noinclude>
==Project 4: Computer character recognition of CAPTCHA==
===By: Weibei Li, Sabrina Bernardi, Nick Murdoch, Joycelin Karel===
===Background===
Completely Automated Public Test to tell Computers and Humans Apart (CAPTCHA) is a challenge response test widely used over the internet to judge if a response is generated by a human being. In the area of network security, CAPTCHA has been intensively employed as a protection for servers against Deny Of Service (DOS) attacks, i.e. compromised computers, called Bots, could not pass this test, thus their flood requests will not be accepted by the server. Another typical use of CAPTCHA is in the setting of online games where cheats happen everyday. To discriminate against the group of users who are using cheating programs, CAPTCHA, a special test in which cheating programs have only negligible probability to pass, is given to each player in the game. Only those who could give correct responses would be permitted to stay in the game.  In addition, to defeat attackers who are interested in exhaustively searching the passwords of certain email accounts, service providers such as Gmail, Hotmail, and Yahoo!, always use CAPTCHA as one of the most signifcant security mechanisms.
The design of CATPCHA is an art. Roughly speaking, the information produced by a CAPTCHA system, regardless of its form (images of distorted letters or images of certain kind of puzzles), should satisfy the following requirements:
# Current computer programs are unable to solve them accurately in polynomial time.
# Most humans can easily solve then in a short period of time.
# Does not rely on the type of CAPTCHA being new to the attacker.
===Challenge===
It is clear that modern CAPTCHA are made in such a way that computers cannot decode it. Thus another hot research topic is how to crack a CAPTCHA system. Many hackers around the world have tried to use computers to decipher these images. To the best of our knowledge, the success rate remains very low. Given some examples, in February 2008 it was reported that spammers had achieved a success rate of 30% to 35% in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA.
A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality [http://en.wikipedia.org/wiki/CAPTCHA]:
* Pre-processing: Removal of background clutter and noise.
* Segmentation: Splitting the image into regions which each contain a single character.
* Classification: Identifying the character in each region.
===Our work===
# Define this CAPTCHA crack problem mathematically and explicitly give the definition of "a success".
# Investigate what other researchers have already done including their algorithms, success rates, and time cost and selectively implement excellent ones.
# We will develop or improve an algorithm that will attempt to identify the characters in a CAPTCHA and implement our algorithm.
# We will gather CAPTCHAs from a certain website as testing samples for our algorithm.  We will record our results and compare them to the success rates and times from other methods that we have found in step 2.
==Project 5 : Rare class classification with SVM==
===By: He He===
Support Vector Machine has been proposed with sound theoretical justifications to provide generalized classifier compared to other algorithms. SVM aims to find an optimal separating plane of two data sets. Based on its inner product form, it's easy to generalized SVM to address the non-linear-separable case by using the "kernel trick". Although SVM is relatively insensitive to unbalanced distribution, it may not generate satisfactory results when the classes distribution is too skewed. For example, in financial fraud data set, most samples are from companies with good records whereas the fraud case is relatively rare; in gene classification, the harmful (e.g. indicating cancer) ones are usually small sets compared to the normal ones.
In SVM algorithm, when solving the dual problem, one of the KKT condition generates a result that implicitly requires roughly balanced distribution, which is the reason why SVM fails at rare class classification. Our work aims to slack or modify the constraints so as to handle classification under unbalanced situation.
We will test the algorithm on some benchmark data sets and also compare the result with other algorithms addressing this problem, for example, ensemble SVM etc.
==Project 6: Application of clustering in  bioinformatics: How the structure of a molecule is related to its activity?  ==
</noinclude>
===By: Mohammad Derakhshani===
In biology, by studying the interaction of molecules we can determine whether an environment is suitable for growth of a particular bacteria, or a certain drug has impact on the a protein.
Purely conducting this research in a Biology lab without having any categorised pre-knowledge, takes a lot of time and resources.
Already there are huge data sets of the structures of molecules and their activities. If we can find a relation between the two, we can guide biology Labs in their studies.
The fact that how the structure of a molecule relates to its interactions with other molecule is an open problem and is studied in several research groups from different perspectives and with different constraints. The aim of this project is to use the data set produced by Brown's biology lab, and study how the interaction of a molecule is related to its different characteristics.
</noinclude>
==Project 7 : Using Support Vector Machines in Multi-class Problems with Many Classes==
</noinclude>
===By: Trevor Bekolay===
Support vector machines are used to solve the binary classification problem. Extending SVMs to multi-class problems is an on-going challenge being approached many different ways. Most attempts to handle multi-class problems involve decomposing the problem into multiple binary classification problems. Other approaches introduce novel ways to consider all classes at the same time. I aim to test these methods in a multi-class problem with many classes, with the goal being to identify which approaches are the most scalable in terms of the number of classes being considered.
Specifically, multi-class support vector machine-based algorithms fall into four main categories:
* One-against-all binary classifiers
* One-against-one binary classifiers
* Directed acyclic graph binary classifiers
* All-together classifiers
I will begin by summarizing and identifying key features of each approach, and then select a suitable implementation of each. I will test each implementation on a data set with many classes, and reflect on the results.
<noinclude>
==Project 8: Music Classification and Segmentation ==
</noinclude>
===By: Aurélien Quévenne===
===Objective=== 
Predict the style of a music track (audio track) from its features
===Background===
MIR or Music Information Retrieval corresponds to the science of retrieving information from music (audio track).
More than classifying music, MIR can also be used for automated music identification, music analysis & knowledge representation, music recommendation...<br>
My project will be focused on the differences between various music styles, mainly classical and pop/rock music.<br>
The k-nearest neighbor algorithm is a simple algorithm which classifies an object depending of the class of its neighbors.
Decision tree based algorithm or LDA/QDA based algorithm already exist in MIR, but k-nearest neighbors algorithm are not mainly used.
===Data Description===
My data will consist of audio tracks from various music styles.<br>
Concerning the formats, two formats will be considered : MP3 and WAV.<br>
The number of track and the number of features are to be determined.<br>
===Goals===     
* Research existing literature on music classification and segmentation
* Define the features which will described an audio track
* Define training data for the learning phase
* Define test data for the classification phase
==Project 9: Kernel methods and boosting in bioinformatics ==
</noinclude>
===By: Durgesh Saraph and Nirvan Singh===
===Objective=== 
To apply kernel methods and a modified boosting algorithm to cellular microarray data in classifying carcinogenic risk factors.
===Background===
Kernel methods such as Support Vector Machines (SVM) have become an active area of research in statistical learning.  SVMs have shown significantly better generalization performance than other learning algorithms in various application domains.  The kernel trick allows SVMs to also tackle nonlinear classification problems.  SVMs have been applied in various areas of bioinformatics, but particularly in classifying gene expression and cellular microarray data. 
Boosting methods are meta-learning algorithms for improving the accuracy of a given algorithm.  They allow the aggregation of an ensemble of weak classifiers into a strong classifier. 
===Goals===     
Our goals are as follows:
*Clearly define microarray-based drug discovery as a statistical learning problem
*Research existing literature on kernel methods and boosting algorithms
*Choose an appropriate algorithm and modify or further develop it it to better suit our application
*Test our algorithm on cellular microarray data
===Data Description===
Our data will consist of microarray observations from a cell biology lab.  Specifics of dataset size, variables, and dimensionality are yet to be determined.
===Techniques===
*Support Vector Machines (SVM)
*Kernel methods
*Boosting


==Project 3 : ==
</noinclude>
</noinclude>
===By: Yao Yao, Min Chen, Zhenghui Wu, Jiaxi Liang===

Latest revision as of 08:45, 30 August 2017

Use the following format for your proposal (maximum one page)

Project 1 : How to Make a Birdhouse

By: Maseeh Ghodsi, Soroush Ghodsi and Ali Ghodsi

Write your proposal here


Project 1 : Recognizing Cheaters in Multi-Player Online Game Environment

By: Mark Stuart, Mathieu Zerter, Iulia Pargaru

Multiplayer online games constitute a very large market in the entertainment industry that generates billions in revenue.<ref> S. F. Yeung, John C. S. Lui, Jianchuan Liu, Jeff Yan, Detecting Cheaters for Multiplayer Games: Theory, Design, and Implementation </ref> Multiplayer on-line games are games in which players use characters to perform specific actions and interact with other characters. The number of online game users is rapidly increasing. Computer play-programs are often used to automatically perform actions on behalf of a human player. This type of cheating gains the player unfair advantage, abusing resources, disrupting players’ gaming experience and even harming servers.<ref>Hyungil Kim, Sungwoo Hong, Juntae Kim, Detection of Auto Programs for MMORPGs</ref> Computer play-programs usually have a specific goal or a task that is repeated often. We suspect that sequences of events and actions created by play-programs are statistically different from the sequence of events generated by a human player. We will be using an on-line game called Tibia created by CIPSoft as a study case.

We have recruited volunteers who agreed to provide us with their gaming information. We are gathering and parsing packets sent by the user to the game server that contain detailed information about the actions performed by the user. The original data consist of: User ID, length of event, time of event, action type, action details, cheating (0 or 1). The sequences of events produced by human and the play-programs will be transformed into a set of features to reveal additional information such as periodicity of events, common sequential actions, rare events or actions not performed often, creating a measure for complexity of an action. Various algorithms will be applied to classify the data represented by the set available attributes. Some similar studies suggest that the following methods perform an effective classification of human vs. machine in on-line game environment:

  • Dynamic Bayesian Network
  • Isomap
  • Desicion Tree
  • Artificial Neural Network
  • Support Vector Machines
  • K nearest neighbours
  • Naive Bayesian

We intend to find a classification algorithm that detects in-game cheating in on-line game Tibia with reasonable accuracy.


Project 2 : A modifeid CART algorithm with soft nodes

By: Jiheng Wang

The tree growing algorithms are often claimed to emulate regression approaches in their ability to handle both continuous and discrete variables. However, the treatment of continuous variables remains somewhat unsatisfactory. For example, the search of the optimal question for a continuous variables is usually reduced to the search of a cut point among all the observed values.

We know that the classical CART algorithm for generating a decision tree is known as the recursive process: given the data represented at a node, either declares that the node to be a leaf or searches for another question to use to split the data into subsets.

We will develop a modified CART algorithm and compare it to the standard tree algorithm CART on some classical data sets, which are freely available from the Internet. A natural approach to tree growing is replacing hard nodes with soft ones. To be specific, when the decision is based on a continuous variable, we apply for a probabilistic decision function instead of a simple binary split. Basically, the logistic function is a good choice of the decision function with respect to its sigmoid shape. Our first aim is to develop a efficient algorithm for computing the probabilistic decision function at every soft node. In CART, the tree grows following a greedy search to maximizes the Information Gain. Here we still use it as our criterion with a little bit of generalization. The following work will compare the performance of hard nodes and soft nodes due to the fact that soft nodes are not guaranteed to yield a better solution. Thus a strategy between the soft nodes and hard nodes, or soft trees and and hard trees should be discussed.

Project 3 : Identifying process faults of waste water treatment plant

By: Yao Yao, Min Chen, Jiaxi Liang, Zhenghui Wu

Objective

To classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.

Background Information

Liquid waste treatment plant and system operators, also known as waste water treatment plant and system operators, remove harmful pollutants from domestic and industrial liquid waste so that it is safe to return to the environment. There are four stages in the water treatment process: plant input, primary settler, secondary settler and plant output. Operators read, interpret, and adjust meters and gauges to make sure that plant equipment and processes are working properly. Operators control chemical-feeding devices, take samples of the water or waste water, perform chemical and biological laboratory analyses, and adjust the amounts of chemicals in the water. We use sensors to sample and measure water quality.

Data Description

This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The data includes 527 data points and 38 variables, recording the water quality of each stage of the treatment process.

Techniques

  • Principal Component Analysis(PCA)
  • Locally Linear Embedding(LEE)
  • Isomap
  • Cluster Analysis and Conceptual Clustering
  • Linear Discriminant Analysis(LDA/FLDA)
  • Linear and Logistic Regression
  • Neural Network(NN)


Project 4: Computer character recognition of CAPTCHA

By: Weibei Li, Sabrina Bernardi, Nick Murdoch, Joycelin Karel

Background

Completely Automated Public Test to tell Computers and Humans Apart (CAPTCHA) is a challenge response test widely used over the internet to judge if a response is generated by a human being. In the area of network security, CAPTCHA has been intensively employed as a protection for servers against Deny Of Service (DOS) attacks, i.e. compromised computers, called Bots, could not pass this test, thus their flood requests will not be accepted by the server. Another typical use of CAPTCHA is in the setting of online games where cheats happen everyday. To discriminate against the group of users who are using cheating programs, CAPTCHA, a special test in which cheating programs have only negligible probability to pass, is given to each player in the game. Only those who could give correct responses would be permitted to stay in the game. In addition, to defeat attackers who are interested in exhaustively searching the passwords of certain email accounts, service providers such as Gmail, Hotmail, and Yahoo!, always use CAPTCHA as one of the most signifcant security mechanisms.

The design of CATPCHA is an art. Roughly speaking, the information produced by a CAPTCHA system, regardless of its form (images of distorted letters or images of certain kind of puzzles), should satisfy the following requirements:

  1. Current computer programs are unable to solve them accurately in polynomial time.
  2. Most humans can easily solve then in a short period of time.
  3. Does not rely on the type of CAPTCHA being new to the attacker.

Challenge

It is clear that modern CAPTCHA are made in such a way that computers cannot decode it. Thus another hot research topic is how to crack a CAPTCHA system. Many hackers around the world have tried to use computers to decipher these images. To the best of our knowledge, the success rate remains very low. Given some examples, in February 2008 it was reported that spammers had achieved a success rate of 30% to 35% in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA.

A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality [1]:

  • Pre-processing: Removal of background clutter and noise.
  • Segmentation: Splitting the image into regions which each contain a single character.
  • Classification: Identifying the character in each region.

Our work

  1. Define this CAPTCHA crack problem mathematically and explicitly give the definition of "a success".
  2. Investigate what other researchers have already done including their algorithms, success rates, and time cost and selectively implement excellent ones.
  3. We will develop or improve an algorithm that will attempt to identify the characters in a CAPTCHA and implement our algorithm.
  4. We will gather CAPTCHAs from a certain website as testing samples for our algorithm. We will record our results and compare them to the success rates and times from other methods that we have found in step 2.

Project 5 : Rare class classification with SVM

By: He He

Support Vector Machine has been proposed with sound theoretical justifications to provide generalized classifier compared to other algorithms. SVM aims to find an optimal separating plane of two data sets. Based on its inner product form, it's easy to generalized SVM to address the non-linear-separable case by using the "kernel trick". Although SVM is relatively insensitive to unbalanced distribution, it may not generate satisfactory results when the classes distribution is too skewed. For example, in financial fraud data set, most samples are from companies with good records whereas the fraud case is relatively rare; in gene classification, the harmful (e.g. indicating cancer) ones are usually small sets compared to the normal ones.

In SVM algorithm, when solving the dual problem, one of the KKT condition generates a result that implicitly requires roughly balanced distribution, which is the reason why SVM fails at rare class classification. Our work aims to slack or modify the constraints so as to handle classification under unbalanced situation.

We will test the algorithm on some benchmark data sets and also compare the result with other algorithms addressing this problem, for example, ensemble SVM etc.

Project 6: Application of clustering in bioinformatics: How the structure of a molecule is related to its activity?

By: Mohammad Derakhshani

In biology, by studying the interaction of molecules we can determine whether an environment is suitable for growth of a particular bacteria, or a certain drug has impact on the a protein. Purely conducting this research in a Biology lab without having any categorised pre-knowledge, takes a lot of time and resources.

Already there are huge data sets of the structures of molecules and their activities. If we can find a relation between the two, we can guide biology Labs in their studies. The fact that how the structure of a molecule relates to its interactions with other molecule is an open problem and is studied in several research groups from different perspectives and with different constraints. The aim of this project is to use the data set produced by Brown's biology lab, and study how the interaction of a molecule is related to its different characteristics.



Project 7 : Using Support Vector Machines in Multi-class Problems with Many Classes

By: Trevor Bekolay

Support vector machines are used to solve the binary classification problem. Extending SVMs to multi-class problems is an on-going challenge being approached many different ways. Most attempts to handle multi-class problems involve decomposing the problem into multiple binary classification problems. Other approaches introduce novel ways to consider all classes at the same time. I aim to test these methods in a multi-class problem with many classes, with the goal being to identify which approaches are the most scalable in terms of the number of classes being considered.

Specifically, multi-class support vector machine-based algorithms fall into four main categories:

  • One-against-all binary classifiers
  • One-against-one binary classifiers
  • Directed acyclic graph binary classifiers
  • All-together classifiers

I will begin by summarizing and identifying key features of each approach, and then select a suitable implementation of each. I will test each implementation on a data set with many classes, and reflect on the results.


Project 8: Music Classification and Segmentation

By: Aurélien Quévenne

Objective

Predict the style of a music track (audio track) from its features

Background

MIR or Music Information Retrieval corresponds to the science of retrieving information from music (audio track). More than classifying music, MIR can also be used for automated music identification, music analysis & knowledge representation, music recommendation...
My project will be focused on the differences between various music styles, mainly classical and pop/rock music.
The k-nearest neighbor algorithm is a simple algorithm which classifies an object depending of the class of its neighbors. Decision tree based algorithm or LDA/QDA based algorithm already exist in MIR, but k-nearest neighbors algorithm are not mainly used.

Data Description

My data will consist of audio tracks from various music styles.
Concerning the formats, two formats will be considered : MP3 and WAV.
The number of track and the number of features are to be determined.

Goals

  • Research existing literature on music classification and segmentation
  • Define the features which will described an audio track
  • Define training data for the learning phase
  • Define test data for the classification phase


Project 9: Kernel methods and boosting in bioinformatics

By: Durgesh Saraph and Nirvan Singh

Objective

To apply kernel methods and a modified boosting algorithm to cellular microarray data in classifying carcinogenic risk factors.

Background

Kernel methods such as Support Vector Machines (SVM) have become an active area of research in statistical learning. SVMs have shown significantly better generalization performance than other learning algorithms in various application domains. The kernel trick allows SVMs to also tackle nonlinear classification problems. SVMs have been applied in various areas of bioinformatics, but particularly in classifying gene expression and cellular microarray data.

Boosting methods are meta-learning algorithms for improving the accuracy of a given algorithm. They allow the aggregation of an ensemble of weak classifiers into a strong classifier.

Goals

Our goals are as follows:

  • Clearly define microarray-based drug discovery as a statistical learning problem
  • Research existing literature on kernel methods and boosting algorithms
  • Choose an appropriate algorithm and modify or further develop it it to better suit our application
  • Test our algorithm on cellular microarray data

Data Description

Our data will consist of microarray observations from a cell biology lab. Specifics of dataset size, variables, and dimensionality are yet to be determined.

Techniques

  • Support Vector Machines (SVM)
  • Kernel methods
  • Boosting