statf09841Proposal: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 27: Line 27:
<noinclude>
<noinclude>


==Project 2 : ==
==Project 2 : A modifeid CART algorighm with soft nodes==
</noinclude>
</noinclude>
===By: Jiheng Wang===
===By: Jiheng Wang===
The tree growing algorithms are often claimed to emulate regression approaches in their ability to handle both continuous and discrete variables. However, the treatment of continuous variables remains somewhat unsatisfactory. For example, the search of the optimal question for a continuous variables is usually reduced to the search of a cut point among all the observed values.


We know that the classical CART algorithm for generating a decision tree is known as the recursive process: given the data represented at a node, either declares that the node to be a leaf or searches for another question to use to split the data into subsets.
We will develop a modified CART algorithm and compare it to the standard tree algorithm CART on some classical data sets, which are freely available from the Internet. A natural approach to tree growing is replacing hard nodes with soft ones. To be specific, when the decision is based on a continuous variable, we apply for a probabilistic decision function instead of a simple binary split. Basically, the logistic function is a good choice of the decision function with respect to its sigmoid shape. Our first aim is to develop a efficient algorithm for computing the probabilistic decision function at every soft node. In CART, the tree grows following a greedy search to maximizes the Information Gain. Here we still use it as our criterion with a little bit of generalization. The following work will compare the performance of hard nodes and soft nodes due to the fact that soft nodes are not guaranteed to yield a better solution. Thus a strategy between the soft nodes and hard nodes, or soft trees and and hard trees should be discussed.


==Project 3 : Identifying process faults of waste water treatment plant ==
==Project 3 : Identifying process faults of waste water treatment plant ==

Revision as of 20:48, 29 October 2009

Use the following format for your proposal (maximum one page)

Project 1 : How to Make a Birdhouse

By: Maseeh Ghodsi, Soroush Ghodsi and Ali Ghodsi

Write your proposal here


Project 1 : Recognizing Cheaters in Multi-Player Online Game Environment

By: Mark Stuart, Mathieu Zerter, Iulia Pargaru

Multiplayer online games constitute a very large market in the entertainment industry that generates billions in revenue.<ref> S. F. Yeung, John C. S. Lui, Jianchuan Liu, Jeff Yan, Detecting Cheaters for Multiplayer Games: Theory, Design, and Implementation </ref> Multiplayer on-line games are games in which players use characters to perform specific actions and interact with other characters. The number of online game users is rapidly increasing. Computer play-programs are often used to automatically perform actions on behalf of a human player. This type of cheating gains the player unfair advantage, abusing resources, disrupting players’ gaming experience and even harming servers.<ref>Hyungil Kim, Sungwoo Hong, Juntae Kim, Detection of Auto Programs for MMORPGs</ref> Computer play-programs usually have a specific goal or a task that is repeated often. We suspect that sequences of events and actions created by play-programs are statistically different from the sequence of events generated by a human player. We will be using an on-line game called Tibia created by CIPSoft as a study case.

We have recruited volunteers who agreed to provide us with their gaming information. We are gathering and parsing packets sent by the user to the game server that contain detailed information about the actions performed by the user. The original data consist of: User ID, length of event, time of event, action type, action details, cheating (0 or 1). The sequences of events produced by human and the play-programs will be transformed into a set of features to reveal additional information such as periodicity of events, common sequential actions, rare events or actions not performed often, creating a measure for complexity of an action. Various algorithms will be applied to classify the data represented by the set available attributes. Some similar studies suggest that the following methods perform an effective classification of human vs. machine in on-line game environment:

  • Dynamic Bayesian Network
  • Isomap
  • Desicion Tree
  • Artificial Neural Network
  • Support Vector Machines
  • K nearest neighbours
  • Naive Bayesian

We intend to find a classification algorithm that detects in-game cheating in on-line game Tibia with reasonable accuracy.


Project 2 : A modifeid CART algorighm with soft nodes

By: Jiheng Wang

The tree growing algorithms are often claimed to emulate regression approaches in their ability to handle both continuous and discrete variables. However, the treatment of continuous variables remains somewhat unsatisfactory. For example, the search of the optimal question for a continuous variables is usually reduced to the search of a cut point among all the observed values.

We know that the classical CART algorithm for generating a decision tree is known as the recursive process: given the data represented at a node, either declares that the node to be a leaf or searches for another question to use to split the data into subsets.

We will develop a modified CART algorithm and compare it to the standard tree algorithm CART on some classical data sets, which are freely available from the Internet. A natural approach to tree growing is replacing hard nodes with soft ones. To be specific, when the decision is based on a continuous variable, we apply for a probabilistic decision function instead of a simple binary split. Basically, the logistic function is a good choice of the decision function with respect to its sigmoid shape. Our first aim is to develop a efficient algorithm for computing the probabilistic decision function at every soft node. In CART, the tree grows following a greedy search to maximizes the Information Gain. Here we still use it as our criterion with a little bit of generalization. The following work will compare the performance of hard nodes and soft nodes due to the fact that soft nodes are not guaranteed to yield a better solution. Thus a strategy between the soft nodes and hard nodes, or soft trees and and hard trees should be discussed.

Project 3 : Identifying process faults of waste water treatment plant

By: Yao Yao, Min Chen, Jiaxi Liang, Zhenghui Wu

Objective

To classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.

Background Information

Liquid waste treatment plant and system operators, also known as waste water treatment plant and system operators, remove harmful pollutants from domestic and industrial liquid waste so that it is safe to return to the environment. There are four stages in the water treatment process: plant input, primary settler, secondary settler and plant output. Operators read, interpret, and adjust meters and gauges to make sure that plant equipment and processes are working properly. Operators control chemical-feeding devices, take samples of the water or waste water, perform chemical and biological laboratory analyses, and adjust the amounts of chemicals in the water. We use sensors to sample and measure water quality.

Data Description

This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The data includes 527 data points and 38 variables, recording the water quality of each stage of the treatment process.

Techniques

  • Principal Component Analysis(PCA)
  • Locally Linear Embedding(LEE)
  • Isomap
  • Cluster Analysis and Conceptual Clustering
  • Linear Discriminant Analysis(LDA/FLDA)
  • Linear and Logistic Regression
  • Neural Network(NN)