statf09841Proposal: Difference between revisions
Line 60: | Line 60: | ||
</noinclude> | </noinclude> | ||
==Project 4: Computer character recognition of CAPTCHA== | |||
===By: Weibei Li, Sabrina Bernardi, Nick Murdoch, Joycelin=== | |||
===Background=== | |||
Completely Automated Public Test to tell Computers and Humans Apart (CAPTCHA) is a challenge response test widely used over the internet to judge if a response is generated by a human being. In the area of network security, CAPTCHA has been intensively employed as a protection for servers against Deny Of Service (DOS) attacks, i.e. compromised computers, called Bots, could not pass this test, thus their flood requests will not be accepted by the server. Another typical use of CAPTCHA is in the setting of online games where cheats happen everyday. To discriminate against the group of users who are using cheating programs, CAPTCHA, a special test in which cheating programs have only negligible probability to pass, is given to each player in the game. Only those who could give correct responses would be permitted to stay in the game. In addition, to defeat attackers who are interested in exhaustively searching the passwords of certain email accounts, service providers such as Gmail, Hotmail, and Yahoo!, always use CAPTCHA as one of the most signifcant security mechanisms. | |||
The design of CATPCHA is an art. Roughly speaking, the information produced by a CAPTCHA system, regardless of its form (images of distorted letters or images of certain kind of puzzles), should satisfy the following requirements: | |||
# Current computer programs are unable to solve them accurately in polynomial time. | |||
# Most humans can easily solve then in a short period of time. | |||
# Does not rely on the type of CAPTCHA being new to the attacker. | |||
===Challenge=== | |||
It is clear that modern CAPTCHA are made in such a way that computers cannot decode it. Thus another hot research topic is how to crack a CAPTCHA system. Many hackers around the world have tried to use computers to decipher these images. To the best of our knowledge, the success rate remains very low. Given some examples, in February 2008 it was reported that spammers had achieved a success rate of 30% to 35% in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA. | |||
A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality [http://en.wikipedia.org/wiki/CAPTCHA]: | |||
* Pre-processing: Removal of background clutter and noise. | |||
* Segmentation: Splitting the image into regions which each contain a single character. | |||
* Classification: Identifying the character in each region. | |||
===Our work=== | |||
# Define this CAPTCHA crack problem mathematically and explicitly give the definition of "a success". | |||
# Investigate what other researchers have already done including their algorithms, success rates, and time cost and selectively implement excellent ones. | |||
# We will develop or improve an algorithm that will attempt to identify the characters in a CAPTCHA and implement our algorithm. | |||
# We will gather CAPTCHAs from a certain website as testing samples for our algorithm. We will record our results and compare them to the success rates and times from other methods that we have found in step 2. |
Revision as of 23:22, 29 October 2009
Use the following format for your proposal (maximum one page)
Project 1 : How to Make a Birdhouse
By: Maseeh Ghodsi, Soroush Ghodsi and Ali Ghodsi
Write your proposal here
Project 1 : Recognizing Cheaters in Multi-Player Online Game Environment
By: Mark Stuart, Mathieu Zerter, Iulia Pargaru
Multiplayer online games constitute a very large market in the entertainment industry that generates billions in revenue.<ref> S. F. Yeung, John C. S. Lui, Jianchuan Liu, Jeff Yan, Detecting Cheaters for Multiplayer Games: Theory, Design, and Implementation </ref> Multiplayer on-line games are games in which players use characters to perform specific actions and interact with other characters. The number of online game users is rapidly increasing. Computer play-programs are often used to automatically perform actions on behalf of a human player. This type of cheating gains the player unfair advantage, abusing resources, disrupting players’ gaming experience and even harming servers.<ref>Hyungil Kim, Sungwoo Hong, Juntae Kim, Detection of Auto Programs for MMORPGs</ref> Computer play-programs usually have a specific goal or a task that is repeated often. We suspect that sequences of events and actions created by play-programs are statistically different from the sequence of events generated by a human player. We will be using an on-line game called Tibia created by CIPSoft as a study case.
We have recruited volunteers who agreed to provide us with their gaming information. We are gathering and parsing packets sent by the user to the game server that contain detailed information about the actions performed by the user. The original data consist of: User ID, length of event, time of event, action type, action details, cheating (0 or 1). The sequences of events produced by human and the play-programs will be transformed into a set of features to reveal additional information such as periodicity of events, common sequential actions, rare events or actions not performed often, creating a measure for complexity of an action. Various algorithms will be applied to classify the data represented by the set available attributes. Some similar studies suggest that the following methods perform an effective classification of human vs. machine in on-line game environment:
- Dynamic Bayesian Network
- Isomap
- Desicion Tree
- Artificial Neural Network
- Support Vector Machines
- K nearest neighbours
- Naive Bayesian
We intend to find a classification algorithm that detects in-game cheating in on-line game Tibia with reasonable accuracy.
Project 2 : A modifeid CART algorithm with soft nodes
By: Jiheng Wang
The tree growing algorithms are often claimed to emulate regression approaches in their ability to handle both continuous and discrete variables. However, the treatment of continuous variables remains somewhat unsatisfactory. For example, the search of the optimal question for a continuous variables is usually reduced to the search of a cut point among all the observed values.
We know that the classical CART algorithm for generating a decision tree is known as the recursive process: given the data represented at a node, either declares that the node to be a leaf or searches for another question to use to split the data into subsets.
We will develop a modified CART algorithm and compare it to the standard tree algorithm CART on some classical data sets, which are freely available from the Internet. A natural approach to tree growing is replacing hard nodes with soft ones. To be specific, when the decision is based on a continuous variable, we apply for a probabilistic decision function instead of a simple binary split. Basically, the logistic function is a good choice of the decision function with respect to its sigmoid shape. Our first aim is to develop a efficient algorithm for computing the probabilistic decision function at every soft node. In CART, the tree grows following a greedy search to maximizes the Information Gain. Here we still use it as our criterion with a little bit of generalization. The following work will compare the performance of hard nodes and soft nodes due to the fact that soft nodes are not guaranteed to yield a better solution. Thus a strategy between the soft nodes and hard nodes, or soft trees and and hard trees should be discussed.
Project 3 : Identifying process faults of waste water treatment plant
By: Yao Yao, Min Chen, Jiaxi Liang, Zhenghui Wu
Objective
To classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process.
Background Information
Liquid waste treatment plant and system operators, also known as waste water treatment plant and system operators, remove harmful pollutants from domestic and industrial liquid waste so that it is safe to return to the environment. There are four stages in the water treatment process: plant input, primary settler, secondary settler and plant output. Operators read, interpret, and adjust meters and gauges to make sure that plant equipment and processes are working properly. Operators control chemical-feeding devices, take samples of the water or waste water, perform chemical and biological laboratory analyses, and adjust the amounts of chemicals in the water. We use sensors to sample and measure water quality.
Data Description
This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The data includes 527 data points and 38 variables, recording the water quality of each stage of the treatment process.
Techniques
- Principal Component Analysis(PCA)
- Locally Linear Embedding(LEE)
- Isomap
- Cluster Analysis and Conceptual Clustering
- Linear Discriminant Analysis(LDA/FLDA)
- Linear and Logistic Regression
- Neural Network(NN)
Project 4: Computer character recognition of CAPTCHA
By: Weibei Li, Sabrina Bernardi, Nick Murdoch, Joycelin
Background
Completely Automated Public Test to tell Computers and Humans Apart (CAPTCHA) is a challenge response test widely used over the internet to judge if a response is generated by a human being. In the area of network security, CAPTCHA has been intensively employed as a protection for servers against Deny Of Service (DOS) attacks, i.e. compromised computers, called Bots, could not pass this test, thus their flood requests will not be accepted by the server. Another typical use of CAPTCHA is in the setting of online games where cheats happen everyday. To discriminate against the group of users who are using cheating programs, CAPTCHA, a special test in which cheating programs have only negligible probability to pass, is given to each player in the game. Only those who could give correct responses would be permitted to stay in the game. In addition, to defeat attackers who are interested in exhaustively searching the passwords of certain email accounts, service providers such as Gmail, Hotmail, and Yahoo!, always use CAPTCHA as one of the most signifcant security mechanisms.
The design of CATPCHA is an art. Roughly speaking, the information produced by a CAPTCHA system, regardless of its form (images of distorted letters or images of certain kind of puzzles), should satisfy the following requirements:
- Current computer programs are unable to solve them accurately in polynomial time.
- Most humans can easily solve then in a short period of time.
- Does not rely on the type of CAPTCHA being new to the attacker.
Challenge
It is clear that modern CAPTCHA are made in such a way that computers cannot decode it. Thus another hot research topic is how to crack a CAPTCHA system. Many hackers around the world have tried to use computers to decipher these images. To the best of our knowledge, the success rate remains very low. Given some examples, in February 2008 it was reported that spammers had achieved a success rate of 30% to 35% in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA.
A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality [1]:
- Pre-processing: Removal of background clutter and noise.
- Segmentation: Splitting the image into regions which each contain a single character.
- Classification: Identifying the character in each region.
Our work
- Define this CAPTCHA crack problem mathematically and explicitly give the definition of "a success".
- Investigate what other researchers have already done including their algorithms, success rates, and time cost and selectively implement excellent ones.
- We will develop or improve an algorithm that will attempt to identify the characters in a CAPTCHA and implement our algorithm.
- We will gather CAPTCHAs from a certain website as testing samples for our algorithm. We will record our results and compare them to the success rates and times from other methods that we have found in step 2.