http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Gtompkin&feedformat=atomstatwiki - User contributions [US]2024-03-29T01:25:01ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=46212F21-STAT 441/841 CM 763-Proposal2020-11-24T15:58:55Z<p>Gtompkin: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will take the "Riiid! Answer Correctness Prediction" Kaggle competition. We will predict students' performances on a particular question based on their historic performance. The performance of other students on this question and the information about the question itself (like its difficulty, length, etc). https://www.kaggle.com/c/riiid-test-answer-prediction/overview<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (2018) <br />
'''Description:'''<br />
We will be reproducing the results of "A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting" by Cannas and Arpino (2018) and applying the results to a new dataset, Right Heart Catheterization (RHC) which includes data from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT), for comparison. This paper uses simulated data and several machine learning algorithms to estimate causal effects in observational studies. The machine learning methods used include CART, Bagging, Boosting, Random Forest, Neural Networks, and Naive Bayes. There are also several variations of measures of covariate balancing used in the study. The importance of tuning the machine learning algorithms' hyperparameters is also investigated with respect to propensity score estimation. <br />
<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Superhuman AI for multiplayer poker (Brown and Sandholm 2019)<br />
<br />
'''Description:'''<br />
Our team aims to recreate the paper “Superhuman AI for multiplayer poker” by Noam Brown and Tuomas Sandholm. The paper talks about algorithm used by the authors to train the AI for playing poker. They primary do so using the Monte Carlo CFR. Poker is a great example for training AI with incomplete data. Furthermore, since it is a multiplayer game, this presents more complications while training the AI. The authors use abstraction to reduce the number of different actions to be considered by the AI, information abstraction and action abstraction both.<br />
We aim to replicate this algorithm for at least 2 players to begin with.<br />
<br />
Link to paper: [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper]</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=45718User:Yktan2020-11-22T18:41:15Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew<br />
<br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is thanks to the collection of large datasets with human annotated labels. However, such labelling is time-consuming and expensive, especially for data that requires expertise such as medical data. Not only that, certain datasets will be noisy due to the biases introduced by different annotators.<br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, with the aim of improving generalization and avoiding confirmation bias.<br />
2) During SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabeling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix to correct the loss function<br />
<li> Leveraging DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods have their own downsides. For example, it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch incorporates the two classes of regularization – consistency regularization and entropy minimization as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (SSL) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples and performing well under high levels of noise.<br />
<br />
== Model Architecture ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modelling. The sample is first split into a labelled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labelled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate, <math> \hat{y}_b </math>. Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> ,<br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
There are four datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009)(both contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016)is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set as 0.02, and reduce it by a factor of 10 after 150 epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% in accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors find that both label refinement and input augmentation are beneficial for DivideMix.<br />
<br />
== Conclusion ==<br />
<br />
This paper has provided a new and effective algorithm for learning with noisy labels by leveraging SSL. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labelling effectively, therefore it is a robust approach to dealing with noise in datasets. DivideMix has also been tested using various datasets with the results consistently being one of the best when compared to other advanced methods.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== References ==<br />
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=45355F21-STAT 441/841 CM 763-Proposal2020-11-19T15:52:22Z<p>Gtompkin: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will take the "Riiid! Answer Correctness Prediction" Kaggle competition. We will predict students' performances on a particular question based on their historic performance. The performance of other students on this question and the information about the question itself (like its difficulty, length, etc). https://www.kaggle.com/c/riiid-test-answer-prediction/overview<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (2018) <br />
'''Description:'''<br />
We will be reproducing the results of "A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting" by Cannas and Arpino (2018). This paper uses simulated data and several machine learning algorithms to estimate causal effects in observational studies. The machine learning methods used include CART, Bagging, Boosting, Random Forest, Neural Networks, and Naive Bayes. There are also several variations of measures of covariate balancing used in the study. The importance of tuning the machine learning algorithms' hyperparameters is also investigated with respect to propensity score estimation. <br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Superhuman AI for multiplayer poker (Brown and Sandholm 2019)<br />
<br />
'''Description:'''<br />
Our team aims to recreate the paper “Superhuman AI for multiplayer poker” by Noam Brown and Tuomas Sandholm. The paper talks about algorithm used by the authors to train the AI for playing poker. They primary do so using the Monte Carlo CFR. Poker is a great example for training AI with incomplete data. Furthermore, since it is a multiplayer game, this presents more complications while training the AI. The authors use abstraction to reduce the number of different actions to be considered by the AI, information abstraction and action abstraction both.<br />
We aim to replicate this algorithm for at least 2 players to begin with.<br />
<br />
Link to paper: [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper]</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=45329stat841f112020-11-18T16:30:40Z<p>Gtompkin: /* Case 2: Linearly Non-Separable Data (Soft Margin) */</p>
<hr />
<div>== [[stat841f14 | Data Visualization (Stat 442 / 842, CM 762 - Fall 2014) ]] ==<br />
== Archive ==<br />
==[[f11Stat841proposal| Proposal for Final Project]]==<br />
==[[f11Stat841presentation| Presentation Sign Up]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
<br />
===Introduction===<br />
''Machine learning'' (ML) methodology in general is an artificial intelligence approach to establish and train a model to recognize the pattern or underlying mapping of a system based on a set of training examples consisting of input and output patterns. Unlike in classical statistics where inference is made from small datasets, machine learning involves drawing inference from an overwhelming amount of data that could not be reasonably parsed by manpower.<br />
<br />
In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance), according to some specific algorithm. The approach of using examples to produce the output labels is known as ''learning methodology''. When the underlying function from inputs to outputs exists, it is referred to as the target function. The estimate of the target function which is learned or output by the learning algorithm is known as the solution of learning problem. In case of classification this function is referred to as the ''decision function''. <br />
<br />
In the broadest sense, any method that incorporates information from training samples in the design of a classifier employs learning. Learning tasks can be classified along different dimensions. One important dimension is the distinction between supervised and unsupervised learning. In supervised learning a category label for each pattern in the training set is provided. The trained system will then generalize to new data samples. In unsupervised learning , on the other hand, training data has not been labeled, and the system forms clusters or natural grouping of input patterns based on some sort of measure of similarity and it can then be used to determine the correct output value for new data instances. <br />
<br />
The first category is known as ''pattern classification'' and the second one as ''clustering''. Pattern classification is the main focus in this course. <br />
<br />
<br />
'''Classification problem formulation ''': Suppose that we are given ''n'' observations. Each observation consists of a pair: a vector <math>\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n</math>, and the associated label <math>y_i</math>.<br />
Where <math>\mathbf{x}_i = (x_{i1}, x_{i2}, ... x_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> and <math>Y_i</math> belongs to some finite set <math>\mathcal{Y}</math>.<br />
<br />
The classification task is now looking for a function <math>f:\mathbf{x}_i\mapsto y</math> which maps the input data points to a target value (i.e. class label). Function <math>f(\mathbf{x},\theta)</math> is defined by a set of parametrs <math>\mathbf{\theta}</math> and the goal is to train the classifier in a way that among all possible mappings with different parameters the obtained decision boundary gives the minimum classification error.<br />
<br />
=== Definitions ===<br />
<br />
The '''true error rate''' for classifier <math>h</math> is the error with respect to the unknown underlying distribution when predicting a discrete random variable Y from a given input X.<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
The '''empirical error rate''' is the error of our classification function <math>h(x)</math> on a given dataset with known outputs (e.g. training data, test data)<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
where h is a clssifier<br />
and <math>\mathbf{I}()</math> is an indicator function. The indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. misclassification)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
<br />
For example, suppose we have 100 new data points with known (true) labels<br />
<br />
<math>X_1 ... X_{100}</math><br />
<math>y_1 ... y_{100}</math><br />
<br />
To calculate the empirical error, we count how many times our function <math>h(X)</math> classifies incorrectly (does not match <math>y</math>) and divide by n=100.<br />
<br />
=== Bayes Classifier ===<br />
The principle of the Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes' Rule, and then assign the object to the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either one of the following:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution. However, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution (including the mean, variance, etc.) is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : The Prior, probability of Y taking the value chosen<br />
<br />
denominator : Equivalent to P(X=x), for all values of Y, normalizes the probability <br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
'''Theorem''': The Bayes Classifier is optimal, i.e., if <math>h</math> is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
<br />
'''Proof''': Consider any classifier <math>h</math>. We can express the error rate as <br />
<br />
::<math> P( \{h(X) \ne Y \} ) = E_{X,Y} [ \mathbf{1}_{\{h(X) \ne Y \}} ] = E_X \left[ E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] \right] </math><br />
<br />
To minimize this last expression, it suffices to minimize the inner expectation. Expanding this expectation:<br />
<br />
::<math> E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] = \sum_{y \in Supp(Y)} P( h(X) \ne y | X) \mathbf{1}_{\{h(X) \ne y \} } </math><br />
which, in the two-class case, simplifies to<br />
<br />
::::<math> = P( h(X) \ne 0 | X) \mathbf{1}_{\{h(X) \ne 0 \} } + P( h(X) \ne 1 | X) \mathbf{1}_{\{h(X) \ne 1 \} } </math><br />
::::<math> = r(X) \mathbf{1}_{\{h(X) \ne 0 \} } + (1-r(X))\mathbf{1}_{\{h(X) \ne 1 \} } </math><br />
<br />
where <math>r(x)</math> is defined as above. We should 'choose' h(X) to equal the label that minimizes the sum. Consider if <math>r(X)>1/2 </math>, then <math>r(X)>1-r(X)</math> so we should let <math>h(X) = 1</math> to minimize the sum. Thus the Bayes classifier is the optimal classifier. <br />
<br />
Why then do we need other classification methods? Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
<br />
<math>f_k(x)</math> is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we must rely on some data to estimate these quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., linear, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) the true error, L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=y) = (1/n) \sum_{i=1}^{n} I(Y_i = y)</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to use ''density estimation'', but the main problem lies with high dimensional spaces, as the estimation results may have a high error rate and sometimes estimation may be infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the data points required for learning increases exponentially.<br />
<br />
To learn more about methods for handling high-dimensional data see <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
The third approach is the simplest.<br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r(x) \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
There are also some interesting reads on Bayes Classification:<br />
* http://esto.nasa.gov/conferences/estc2004/papers/b8p4.pdf (NASA)<br />
* http://www.cmla.ens-cachan.fr/fileadmin/Membres/vachier/Garcia6812.pdf (application to medical images)<br />
* http://www.springerlink.com/content/g221vh5m6744362r/ (Journal of Medical Systems)<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} \text{exp}\big(-\frac{1}{2}(\mathbf{x-\mu_k}) \Sigma_k^{-1}(\mathbf{x-\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
<math> \frac{1}{(2\pi)^{d/2} |\Sigma_1|^{1/2}} exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma_1^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma_0|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_0}) \Sigma_0^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
<math> \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x- \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
3) Cancel <math>(2\pi)^{-d/2} |\Sigma|^{-1/2}</math> from both sides.<br />
<br />
<br />
<math> exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = exp(-\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
4) Take log of both sides.<br />
<br />
<math> -\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )+ \text{log}(\pi_1) = -\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )+ \text{log}(\pi_0)</math><br />
<br />
5) Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-\frac{1}{2}(\mathbf{x - \mu_1})^T \Sigma^{-1} (\mathbf{x-\mu_1}) + \text{log}(\pi_1) - [-\frac{1}{2}(\mathbf{x - \mu_0})^T \Sigma^{-1} (\mathbf{x-\mu_0}) + \text{log}(\pi_0)] = 0 </math><br />
<br />
<br />
<math>\frac{1}{2}[-\mathbf{x}^T \Sigma^{-1}\mathbf{x - \mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ \text{log}(\frac{\pi_1}{\pi_0}) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>\frac{1}{2}[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ \text{log}(\frac{\pi_1}{\pi_0}) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(\mathbf{x}) = Pr(X=\mathbf{x}|Y=k)</math> is Gaussian. The Bayes Classifier is<br />
:<math>\,h^*(\mathbf{x}) = \arg\max_{k} \delta_k(\mathbf{x})</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(\mathbf{x}) = \mathbf{x}^\top\Sigma^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k^\top\Sigma^{-1}\boldsymbol{\mu}_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) = (\mathbf{x}-\boldsymbol{\mu}_k)^\top(\mathbf{x}-\boldsymbol{\mu}_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,\mathbf{x}</math> and <math>\,\boldsymbol{\mu}_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)&= (\mathbf{x}-\boldsymbol{\mu}_k)^\top U_kS_k^{-1}U_k^T(\mathbf{x}-\boldsymbol{\mu}_k)\\<br />
& = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-1}(U_k^\top \mathbf{x}-U_k^\top \boldsymbol{\mu}_k)\\<br />
& = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate some parameters. Here is a comparison between the number of parameters needed to be estimated for LDA and QDA:<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters. Thus QDA may suffer much more extremely from the curse of dimensionality.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
In this approach the feature vector is augmented with the quadratic terms (i.e. new dimensions are introduced) where the original data will be projected to that dimensions. We then apply LDA on the new higher-dimensional data. <br />
<br />
The motivation behind this approach is to take advantage of the fact that fewer parameters have to be calculated in LDA , as explained in previous sections, and therefore have a more robust system in situations where we have fewer data points.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we have a quadratic function to estimate: <math>g(\mathbf{x}) = y = \mathbf{x}^T\mathbf{v}\mathbf{x} + \mathbf{w}^T\mathbf{x}</math>.<br />
<br />
Using this trick, we introduce two new vectors, <math>\,\hat{\mathbf{w}}</math> and <math>\,\hat{\mathbf{x}}</math> such that:<br />
<br />
<math>\hat{\mathbf{w}} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]^T</math><br />
<br />
and<br />
<br />
<math>\hat{\mathbf{x}} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]^T</math><br />
<br />
We can then apply LDA to estimate the new function: <math>\hat{g}(\mathbf{x},\mathbf{x}^2) = \hat{y} =\hat{\mathbf{w}}^T\hat{\mathbf{x}}</math>.<br />
<br />
Note that we can do this for any <math>\, x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Note, we are not applying QDA, but instead extending LDA to calculate a non-linear boundary, that will be different from QDA. This algorithm is called nonlinear LDA.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_d </math>, to approximate the space spanned by the original data points <math>\ \mathbf{x}=[x_1, x_2, ... , x_D]^T </math>. We can choose d based on what percentage of the variance of the original data we would like to maintain. <br />
<br />
The first PC, <math>\ \mathbf{u}_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data. The second PC, <math>\ \mathbf{u}_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ \mathbf{u}_D </math> which has the minimum variance.<br />
<br />
Let <math>u_i = \mathbf{w}^T\mathbf{x_i}</math> be the projection of the data point <math>\mathbf{x_i}</math> on the direction of '''w''' if '''w''' is of length one.<br />
<br />
<br />
<math>\mathbf{u = (u_1,....,u_D)^T}\qquad</math> , <math>\quad\mathbf{w^Tw = 1 }</math><br />
<br />
<br />
<math>var(u) =\mathbf{w}^T X (\mathbf{w}^T X)^T = \mathbf{w}^T X X^T\mathbf{w} = \mathbf{w}^TS\mathbf{w} \quad </math> <br />
Where <math>\quad X X^T = S </math> is the sample covariance matrix.<br />
<br />
<br />
<br />
We would like to find the <math>\ \mathbf{w} </math> which gives us maximum variation:<br />
<br />
<math>\ \max (Var(\mathbf{w}^T \mathbf{x})) = \max (\mathbf{w}^T S \mathbf{w}) </math> <br />
<br />
<br />
Note: we require the constraint <math>\ \mathbf{w}^T \mathbf{w} = 1 </math> because if there is no constraint on the length of <math>\ \mathbf{w} </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(\mathbf{w}, \lambda) = \mathbf{w}^T S\mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial \mathbf{w}}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2S\mathbf{w} - 2 \lambda \mathbf{w} = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle S\mathbf{w} = \lambda \mathbf{w}</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle S\mathbf{w}= \lambda \mathbf{w} </math> , and <math>\displaystyle \mathbf{w}^T \mathbf{w}=1</math> , then we can write<br />
<br />
<math>\displaystyle \mathbf{w}^T S\mathbf{w}= \mathbf{w}^T\lambda \mathbf{w}= \lambda \mathbf{w}^T \mathbf{w} =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math> ---- (S is a co-variance matrix, and therefore it's symmetric)<br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(\mathbf{w}^\top \mathbf{w}) = \mathbf{w}^\top S \mathbf{w}= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ \mathbf{w}</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====PCA Implementation Using Singular Value Decomposition====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." <ref> Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf </ref><br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
=== Dual PCA ===<br />
<br />
Singular value decomposition allows us to formulate the principle components algorithm entirely in terms of dot products between data points and limit the direct dependence on the original dimensionality ''d''. Now assume that the dimensionality ''d'' of the ''d × n'' matrix of data X is large (i.e., ''d >> n''). In this case, the algorithm described in previous sections become impractical. We would prefer a run time that depends only on the number of training examples ''n'', or that at least has a reduced dependence on ''n''.<br />
Note that in the SVD factorization <math>\ X = U \Sigma V^T </math>, the eigenvectors in <math>\ U </math> corresponding to non-zero singular values in <math>\ \Sigma </math> (square roots of eigenvalues) are in a one-to-one correspondence with the eigenvectors in <math>\ V </math> .<br />
After performing dimensionality reduction on <math>\ U </math> and keeping only the first ''l'' eigenvectors, corresponding to the top ''l'' non-zero singular values in <math>\ \Sigma </math>, these eigenvectors will still be in a one-to-one correspondence with the first ''l'' eigenvectors in <math>\ V </math> : <br />
<br />
<math>\ X V = U \Sigma </math><br />
<br />
<math>\ \Sigma </math> is square and invertible, because its diagonal has non-zero entries. Thus, the following conversion between the top ''l'' eigenvectors can be derived:<br />
<br />
<math>\ U = X V \Sigma^{-1} </math><br />
<br />
Now Replacing <math>\ U </math> with <math>\ X V \Sigma^{-1} </math> gives us the dual form of PCA.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011 - Oct. 04, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
<br />
=== PCA ===<br />
<br />
- Find a rank d subspace which minimize the squared reconstruction error:<br />
<br />
<math> \Sigma = |x_i - \hat{x} |^2</math><br />
<br />
where <math>\hat{x} </math> is projection of original data.<br />
<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not produce the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lighting filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. As another example , if we imagine 2 cigar like clusters in 2 dimensions, one cigar has <math>y = 1</math> and the other <math>y = -1</math>. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space.) See figure below <ref>www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf</ref>. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''.<br />
The main difference between FDA and PCA is that, in PCA we are interested in transforming the data to a new coordinate system such that the greatest variance of data lies on the first coordinate, but in FDA, we project the data of each class onto a point in such a way that the resulting points would be as far apart from each other as possible. The FDA goal is achieved by projecting data onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
[[File:Classification.jpg | Two cigar distributions where the direction of greatest variance is not the most useful for classification]]<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension.<br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
First, we need to reduce the dimensionality of covariate to one dimension (two-class case) by projecting the data onto a line. That is take the d-dimensional input values x and project it to one dimension by using <math>z=\mathbf{w}^T \mathbf{x}</math> where <math>\mathbf{w}^T </math> is 1 by d and <math>\mathbf{x}</math> is d by 1.<br />
<br />
Goal: choose the vector <math>\mathbf{w}=[w_1,w_2,w_3,...,w_d]^T </math> that best seperate the data, then we perform classification with projected data <math>z</math> instead of original data <math>\mathbf{x}</math> .<br />
<br />
<br />
<math>\hat{{\mu}_0}=\frac{1}{n_0}\sum_{i:y_i=0} x_i</math><br />
<br />
<math>\hat{{\mu}_1}=\frac{1}{n_1}\sum_{i:y_i=1} x_i</math><br />
<br />
<math>\mathbf{x}\rightarrow\mathbf{w}^{T}\mathbf{x}</math>. <br /><br />
<math>\mathbf{\mu}\rightarrow\mathbf{w}^{T}\mathbf{\mu}</math>.<br /><br />
<math>\mathbf{\Sigma}\rightarrow\mathbf{w}^{T}\mathbf{\Sigma}\mathbf{w}</math> <br /><br />
<br />
<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_0 \mathbf{w}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} ^{T}\mathbf{\Sigma}_0 \mathbf{w} + \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ^{T}( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}</math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w} \mathbf{w}^{T} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \; \alpha \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}</math> <br /><br />
(optimized function)<br />
<br />
subject to:<br />
<br />
:: <math> {\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} = 1 </math><br /><br />
(constraint)<br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \lambda(\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ \mathbf{w} </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the case of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br /><br />
(this equation indicates the direction of the separation).<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>, where <math>\mathbf{W}</math> is a d*(k-1) matrix if X is in d-dim. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1 + \dots + \mathbf{w}_k^{T} \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
The first (k-1) eigenvector of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math> are required (k-1) direction. That is why under multiclass case, for the k-class problem, we need to project initial points onto k-1 direction.<br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = 1 </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} </math> <br />
<br />
where <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math> is called the within class variance and <math>\ \mathbf{S}_B = \mathbf{S}_T - \mathbf{S}_W </math> is called the between class variance where <math>\mathbf{S}_T </math> is the variance of all the data together.<br />
<br />
Every column of <math> \mathbf{w} </math> is parallel to a single eigenvector.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
=== Optical Character Recognition (OCR) using FDA ===<br />
Optical Character Recognition (OCR) is a method to translate scanned, human-readable text into machine-encoded text. In class, we have employed FDA to recognize digits. A paper <ref>Manjunath Aradhya, V.N., Kumar, G.H., Noushath, S., Shivakumara, P., "Fisher Linear Discriminant Analysis based Technique Useful for Efficient Character Recognition", Intelligent Sensing and Information Processing, 2006.</ref> describes the use of FDA to recognize printed documents written in English and Kannada, the fifth most popular language in India. The researchers conducted two types of experiments: one on printed Kannada and English documents and another on handwritten English characters. In the first type of experiments, they conducted four experiments: i) clear and degraded characters in specific fonts; ii) characters in various size; iii) characters in various fonts; iv) characters with noise. In experiment i, FDA achieved 98.2% recognition rate with 12 projection vectors in 21,560 samples. In experiment ii, it achieved 96.9% recognition rate with 10 projection vectors in 11,200 samples. In experiment iii, it achieved 93% recognition rate with 17 projection vectors in 19,850 samples. In experiment iv, it achieved 96.3% recognition rate with 14 projection vectors in 20,000 samples. Overall, the recognition by FDA was very satisfying. In the second type of experiment, a total of 12,400 handwriting samples from 200 different writers were collected. With 175 samples for training purpose, the recognition rate by FDA is 92% with 35 projection vectors.<br />
<br />
=== Facial Recognition using FDA ===<br />
<br />
The Fisherfaces method of facial recognition uses PCA and FDA in a similar way to using just PCA. However, it is more advantageous than using on PCA because it minimizes variation within each class and maximizes class separation. The PCA only method is, therefore, more sensitive to lighting and pose variations. In studies done by Belhumeir, Hespanda, and Kiregeman (1997) and Turk and Pentland (1991), this method had a 96% recognition rate. <ref>Bagherian, Elham. Rahmat, Rahmita. Facial Feature Extraction for Face Recognition: a Review. International Symposium on Information Technology, 2008. ITSim2 article number 4631649.</ref><br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
Both Regression and Classification are aimed to find a function h which maps data X to feature Y. In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. In linear regression, data is modeled using a linear function, and unknown parameters are estimated from the data. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
When running a regression we are making two assumptions, <br />
<br />
# A linear relationship exists between two variables (i.e. X and Y) <br />
# This relationship is additive (i.e. <math>Y= f_1(x_1) + f_2(x_2) + …+ f_n(x_n)</math>). Technically, linear regression estimates how much Y changes when X changes one unit. <br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>. For the linear model, we assume that either the regression function <math>r(\mathbf{x})</math> is linear, or the linear model has a reasonable approximation.<br />
<br />
Here is a simple example. If <math>\ Y = \{0,1\}</math> (a two-class problem), then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{, if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{, otherwise} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = y_i = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> , <math>\mathbf{x_{i}} \in \mathbb{R}^{d}</math><br />
and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> can be combined into a d+1 dimensional vector, <math>\tilde{\mathbf{\beta}}</math>. The term ''1'' is appended to <math>\ x </math>. Thus, the function to be minimized can now be re-expressed as:<br />
<br />
<math>\ LS = \min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta}^T \tilde{x_i} )^2 </math><br />
<br />
<math>\ LS = \min_{\tilde{\beta}} || y - X \tilde{\beta} ||^2 </math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right) \in \mathbb{R}^{d+1}</math> and <br />
<br />
<math>\tilde{x} = \left( \begin{array}{c}{x_{1}}<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
{x}_{d} \\ \\<br />
1 \end{array} \right) \in \mathbb{R}^{d+1}</math>.<br />
<br />
where <math>\tilde{\mathbf{\beta}}</math> is a d+1 by 1 matrix(a d+1 dimensional vector)<br />
<br />
Here <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a n by d+1 matrix with each row represents a data point with a 1 as the last entry. X also can be seen as a matrix<br />
in which each column represents a feature and the <math>\ (d+1)^{th} </math> column is an all-one vector corresponding to <math>\ \beta_0 </math> .<br />
<br />
<math>\ {\tilde{\beta}}</math> that minimizes the error is:<br />
<br />
<math>\ \frac{\partial LS}{\partial \tilde{\beta}} = -2(X^T)(y-X\tilde{\beta})=0 </math>, which gives us <math>\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty </math>. When <math>\ X^TX</math> is singular we have to use pseudo inverse for obtaining optimal <math>\ \tilde{\beta}</math>.<br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (for which the target values are encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2} </math>, where <math>\ n_i</math> is the number of data points in class i and n is the total number of points in the data set) this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==== Practical Usefulness ====<br />
Linear regression in general is not very useful for classification purposes. One of the main problems is that new data may not always have a positive ("more successful") impact on the linear regression learning algorithm due to the non-linear "binary" form of the classes. Consider the following simple example:<br />
<br />
[[File: linreg1.jpg|center|frame]]<br />
<br />
The boundary decision at <math>r(x)=0.5</math> was added for visualization purposes. Clearly, linear regression categories this data properly. However, consider adding one more datum:<br />
<br />
[[File: linreg2.jpg|center|frame]]<br />
<br />
This datum actually skews linear regression to the point that it misclassified some of the data points that should be labelled '1'. This shows how linear regression cannot adapt well to binary classification problems.<br />
<br />
==== general guidelines for building a regression model====<br />
<br />
# Make sure all relevant predictors are included. These are based on your research question, theory and knowledge on the topic.<br />
# Combine those predictors that tend to measure the same thing (i.e. as an index).<br />
# Consider the possibility of adding interactions (mainly for those variables with large effects)<br />
# Strategy to keep or drop variables:<br />
## Predictor not significant and has the expected sign -> Keep it<br />
## Predictor not significant and does not have the expected sign -> Drop it<br />
## Predictor is significant and has the expected sign -> Keep it<br />
## Predictor is significant but does not have the expected sign -> Review, you may need more variables, it may be interacting with another variable in the model or there may be an error in the data.<ref>http://dss.princeton.edu/training/Regression101.pdf</ref><br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.<ref>http://en.wikipedia.org/wiki/Logistic_regression</ref><br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid conditional density function since the two components (<math>f_1</math> and <math>f_2</math>, shown just below) sum to 1 and remain in [0, 1].<br />
<br />
It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /><br/><br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on [http://www.wolframalpha.com/input/?i=Plot%5BE^x/%281+%2B+E^x%29,+{x,+-10,+10}%5D%29 this graph] of <math>\ P(Y=1 | X=x) </math>.<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted <math>f_2</math>. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function <math>f_2</math> is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on [http://www.wolframalpha.com/input/?i=Plot%5B1/%281+%2B+E^x%29,+{x,+-10,+10}%5D%29 this graph] of <math>\ P(Y=0 | X=x) </math>.<br />
<br />
Since <math>f_1</math> and <math>f_2</math> specify the conditional distribution, the Bernoulli distribution is appropriate for specifying the likelihood of the class. Conveniently code the two classes via 0 and 1 responses, then the likelihood of <math>y_i</math> for given input <math>x_i</math> is given by,<br />
<br />
<math>f(y_i|\mathbf{x_i}) = (f_1(\mathbf{x_i}))^{y} (1-f_1\mathbf{x_i}))^{1-y} = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}</math><br />
<br />
Thus y takes value 1 with success probability <math>f_1</math> and value 0 with failure probability <math>1 - f_1</math>. We can use this to derive the likelihood for N training observations, and search for the maximizing parameter <math>\beta</math>. <br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (<math>x_i</math>) to<br />
our output (<math>y_i</math>). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem into an optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of <math>f (x|\theta)</math>, we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>. For example, we may know that the data come from a Gaussian distribution but we don't know the mean and variance of the distribution. <br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> the probability of x given <math>\mu</math>, while Frequentists use <math>f(\mathbf{x};\mu)</math> the probability of x and <math>\mu</math> occurring together. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f(\underline{\mathbf{x}}|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. where <math>\underline{\mathbf{x}}=\{x_i\}_{i=1}^{n}</math> (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}</math><br />
<br />
<math>\ell(\mathbf{\beta\,}) = \sum_{i=1}^n \left[ y_i \ln(P(Y=y_i|X=x_i)) + (1-y_i) \ln(1-P(Y=y_i|X=x_i))\right]<br />
</math><br />
<br />
This is the likelihood function we want to maximize. Note that <math>-\ell(\mathbf{\beta\,})</math> can be interpreted as the cost function we want to minimize. Simplifying, we get:<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
Now set <math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> equal to 0, and <math> \mathbf{\beta\,} </math> can be numerically solved by Newton's method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Raphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>, such Xs are called the roots of function f. Iteration can be used to solve for x using the following equation<br />
<math>\mathbf{x_n} = \mathbf{x_{n-1}} - \frac{f(\mathbf{x_{n-1}})}{f'(\mathbf{x_{n-1}})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction <math>\ \frac{f(x_{n-1})}{f'(x_{n-1})}</math> that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_n}</math>. Iterating from the original guess slowly converges to a solution that will be sufficiently accurate to the actual solution <math>\mathbf{x_n}</math>. Note that this may find local optimums, and each function may require multiple guesses to find all the roots.<br />
<br />
=====Matlab Example=====<br />
<br />
Below is the Matlab code to find a root of the function <math>\,y=x^2-2500</math> from the initial guess of <math>\,x=90</math>. The roots of this equation are trivially solved analytically to be <math>\,x=\pm 50</math>. <br />
<br />
x=1:100;<br />
y=x.^2 - 2500; %function to find root of<br />
plot(x,y);<br />
<br />
x_opt=90; %starting guess<br />
x_traversed=[];<br />
y_traversed=[];<br />
error=[];<br />
<br />
for i=1:6,<br />
y_opt=x_opt^2-2500;<br />
y_prime_opt=2*x_opt;<br />
<br />
%save results of each iteration<br />
x_traversed=[x_traversed x_opt];<br />
y_traversed=[y_traversed y_opt];<br />
error=[error abs(y_opt)];<br />
<br />
%update minimum<br />
x_opt=x_opt-(y_opt/y_prime_opt);<br />
end<br />
<br />
hold on;<br />
plot(x_traversed,y_traversed,'r','LineWidth',2);<br />
title('Progressions Towards Root of y=x^2 - 2500');<br />
legend('y=x^2 - 2500','Progression');<br />
xlabel('x');<br />
ylabel('y');<br />
<br />
hold off;<br />
figure();<br />
semilogy(1:6,error);<br />
title('Error vs Iteration');<br />
xlabel('Iteration');<br />
ylabel('Absolute Y Error');<br />
<br />
In this example the Newton method converges to an optimum to within machine precision in only 6 iterations as can be seen from the plot of the Y deviate below.<br />
<br />
[[File:newton_error.png]]<br />
[[File:newton_progression.png]]<br />
<br />
===Advantages/Limitation of Linear Regression ===<br />
<br />
*Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results.<br />
*Linear regression is often inappropriately used to model non-linear relationships.<br />
*Linear regression is limited to predicting numeric output.<br />
*A lack of explanation about what has been learned can be a problem.<br />
<br />
<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* It is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group.<br />
* It does not assume a linear relationship between the IV and DV.<br />
* It may handle nonlinear effects.<br />
* You can add explicit interaction and power terms.<br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independent variables be interval. <br />
* It does not require that the independent variables be unbounded.<br />
<br />
===Comparison Between Logistic Regression And Linear Regression===<br />
<br />
Linear regression is a regression where the explanatory variable X and response variable Y are linearly related. Both X and Y can be continuous variables, and for every one unit increase in the explanatory variable, there is a set increase or decrease in the response variable Y. A closed form solution exists for the least squares estimate of <math>\beta</math>.<br />
<br />
Logistic regression is a regression where the explanatory variable X and response variable Y are not linearly related. The response variable provides the probability of occurrence of an event. X can be continuous but Y must be a categorical variable (e.g., can only assume two values, i.e. 0 or 1). For every one unit increase in the explanatory variable, there is a set increase or decrease in the probability of occurrence of the event. No closed form solution exists for the least squares estimate of <math>\beta</math>.<br />
<br />
<br />
In terms of making assumptions on the data set: In LDA, we assumed that the probability density function (PDF) of each class and priors were Gaussian and Bernoulli, respectively. However, in Logistic Regression, we assumed that the PDF of each class had a parametric form and we ignored the priors. Therefore, we may conclude that Logistic regression has less assumptions than LDA.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have:<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \ln{\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} + (1 - y_i) \ln{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} \right) \end{align}</math><br />
<br />
This implies that:<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i \left( {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln(1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}) \right) - (1 - y_i)\ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method where we calculate the first and second derivative at each iteration.<br /><br />
<br /><br />
<br />
====Newton's Method====<br />
Here is how we usually implement Newton's Method: <math>\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f(\mathbf{x_n})}{f'(\mathbf{x_n})}.\,\!<br />
</math>. In our particular case, we look for x such that <math>g'(x) = 0</math>, and implement it by <math>\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f'(\mathbf{x_n})}{f''(\mathbf{x_n})}.\,\!<br />
</math>.<br /><br />
In practice, the convergence speed depends on |F'(x*)|, where F(x) = <math>\mathbf{x} - \frac{f(\mathbf{x})}{f'(\mathbf{x})}.\,\!</math>. The smaller the |F'(x*)| is, the faster the convergence is.<br /><br />
<br /><br />
<br /><br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(x_i|\beta) \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
where <math>\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} </math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(1 - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (P(x_i|\beta))(1 - P(x_i|\beta)) \right) \\[8pt] \end{align}</math><br />
<br />
again where <math>\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} </math><br />
<br />
<math>\, \beta\,^{new} \leftarrow \beta\,^{old}-\frac {f(\beta\,^{old})}{f'(\beta\,^{old})} </math><br /><br />
<br \><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is. (However, some improper beta will cause I to be a singular matrix).<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1} S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
Let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
Let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
Let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
Let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
The score vector, information matrix and update equation can be rewritten in terms of this new matrix notation, so the first derivative is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
And the second derivative is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial^{2} \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1}S(\beta\,^{(r)} ) {}</math><br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\ \min_{\beta}(y-X \beta)^T(y-X \beta)</math><br />
<br />
Similarly, we can say that <math>\ \beta^{(r+1)}</math> is the solution of a weighted least square problem in the new space of <math>\ \mathbf{z} </math>: ( compare the equation of <math>\ \beta^{(r+1)}</math> with the solution of weighted least square <br />
<math>\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty </math> )<br />
<br />
<math>\beta^{(r+1)} \leftarrow arg \min_{\beta}(\mathbf{z}-X \beta)^T W (\mathbf{z}-X \beta)</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have ''K'' classes. For 2 classes ''K'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=K|X=x)} = e^{\beta_l^T x}</math><br /><br />
(this is resulting from <br />
<math>f_1(x)= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math> and <math>f_2(x)= (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math> )<br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = (\beta_l-\beta_k)^T x</math> , the log ratio of the posterior probabilities as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = \beta_1^T x</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = \beta_2^T x</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = \beta_{K-1}^T x</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}\;\;for \; k=1,\ldots, K-1</math> <br />
<br />
<math> P(Y=K|X=x)=\frac{1}{1+\sum_{i=1}^{K-1}{e^{\beta_i^T x}}} </math><br />
<br />
===Logistic Regression Vs. Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis (LDA) are widely used for classification. Both models build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
LDA used more parameters.<br />
<br />
However, these two models differ in their basic approach. While Logistic Regression is more relaxed and flexible in its assumptions, LDA assumes that its explanatory variables are normally distributed, linearly related and have equal covariance matrices for each class. Therefore, it can be expected that LDA is more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. <br />
<br />
<br />
Also, the total number of parameters to compute is different for Logistic Regression and LDA. If the explanatory variables have d dimensions and there are two classes to categorize, we need to estimate <math>\ d+1</math> parameters in Logistic Regression (all elements of the d by 1 <math>\ \beta </math> vector plus the scalar <math>\ \beta_0 </math>) and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in LDA (two mean values for the Gaussians, the d by d symmetric covariance matrices, and two priors for the two classes) and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
<br />
Note that the number of parameters also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations. Logistic Regression can be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a <math>-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}</math> penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton-Raphson method as given in Zhu and hastie (2004). LDA involves the inversion of a d x d covariance matrix. When d is bigger than n (where n is the number of observations) this matrix has rank n < d and thus is singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in [0,1]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>. More details on RDA can be found in Guo et al. (2006).<br />
<br />
<br />
Because the Logistic Regression model has the form <math>log\frac{f_1(x)}{f_0(x)} = \beta{x}</math>, we can clearly see the role of each input variable in explaining the outcome. This is one advantage that Logistic Regression has over other classification methods and is why it is so popular in data analysis. <br />
<br />
<br />
In terms of the performance speed, since LDA is non-iterative, unlike Logistic Regression which uses the iterative Newton-Raphson method, LDA can be expected to be faster than Logistic Regression.<br />
<br />
===Example===<br />
<br />
(Not discussed in class.) One application of logistic regression that has recently been used is predicting the winner of NFL games. Previous predictors, like Yards Per Carry (YPC), were used to build probability models for games. Now, the Success Rate (SR), defined as the percentage of runs in which the a team’s point expectancy has improved, is shown to be a better predictor of a team's performance. SR is based on down, distance and yard line and is less susceptible to rare breakaway plays that can be considered outliers. More information can be found at [http://fifthdown.blogs.nytimes.com/2011/09/29/n-f-l-game-probabilities-are-back-with-one-adjustment/].<br />
<br />
== Perceptron ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
Perceptron is a simple, yet effective, linear separator classifier. The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
The LS computes a linear combination of factor of input and returns the sign. <br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result (i.e it uses a step function as it's [http://en.wikipedia.org/wiki/Activation_function activation function] ). The figures on the right give an example of a perceptron. In these examples, <math>\ x^i</math> is the ''i''-th feature of a sample and <math>\ \beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes. From a geometrical point of view, Perceptron assigns label "1" to elements on one side of vector <math>\ \beta</math> and label "-1" to elements on the other of <math>\ \beta</math>, where <math>\ \beta</math> is a vector of <math>\ \beta_i</math>s.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions.<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as '''Multi-Layer Perceptron (MLP)''' which uses multiple layers of perceptrons with nonlinear activation functions so that it can classify data that is not linearly separable.<br />
<br />
=== History of Perceptrons and Other Neural Models ===<br />
One of the first perceptron-like models is the '''"McCulloch-Pitts Neuron"''' model developed by McCulloch and Pitts in the 1940's <ref> W. Pitts and W. S. McCulloch, "How we know universals: the perception of auditory and visual forms," ''Bulletin of Mathematical Biophysics'', 1947.</ref>. It uses a weighted sum of the inputs that is fed through an activation function, much like the perceptron. However, unlike the perceptron, the weights in the "McCulloch-Pitts Neuron" model are not adjustable, so the "McCulloch-Pitts Neuron" is unable to perform any learning based on the input data.<br />
<br />
As stated in the introduction of the [[#Perceptron | perceptron]] section, the '''Perceptron''' was developed by Rosenblatt around 1960. Around the same time as the perceptron was introduced, the '''Adaptive Linear Neuron (ADALINE)''' was developed by Widrow <ref name="Widrow"> B. Widrow, "Generalization and information storage in networks of adaline 'neurons'," ''Self Organizing Systems'', 1959.</ref>. The ADALINE differs from the standard perceptron by using the weighted sum (the net) to adjust the weights in the learning phase. The standard perceptron uses the output to adjust its weights (i.e. the net after it passed through the activation function). <br />
<br />
Since both the perceptron and ADALINE are only able to handle data that is linearly separable '''Multiple ADALINE (MADALINE)''' was introduced <ref name="Widrow"/>. MADALINE is a two layer network to process multiple inputs. Each layer contains a number of ADALINE units. The lack of an appropriate learning algorithm prevented more layers of units to be cascaded at the time and interest in "neural networks" receded until the 1980's when the backpropagation algorithm was applied to neural networks and it became possible to implement the '''Multi-Layer Perceptron (MLP)'''.<br />
<br />
Many importand advances have been boosted by the use of inexpensive computer emulations. Following an initial period of enthusiasm, the field survived a period of frustration and disrepute. During this period when funding and professional support was minimal, important advances were made by relatively few reserchers. These pioneers were able to develop convincing technology which surpassed the limitations identified by Minsky and Papert. Minsky and Papert, published a book (in 1969) in which they summed up a general feeling of frustration (against neural networks) among researchers, and was thus accepted by most without further analysis. Currently, the neural network field enjoys a resurgence of interest and a corresponding increase in funding.<ref><br />
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Historical background<br />
</ref><br />
<br />
== Perceptron Learning Algorithm (Lecture: Oct. 13, 2011) ==<br />
Like all of the learning methods we have seen, learning in a perceptron model is accomplished by minimizing a cost (or error) function, <math>\phi(\boldsymbol{\beta}, \beta_0)</math>. In the perceptron case, the cost function is simply the difference of the output (<math>sig(\sum_{i=0}^d \beta_i x^{(i)})</math>) and the target. To achieve this, we define a cost function, <math>\phi(\boldsymbol{\beta}, \beta_0)</math>, as a summation of the distance between all misclassified points and the hyper-plane, or the decision boundary. To minimize this cost function, we need to estimate <math>\boldsymbol{\beta, \beta_0}</math>. <br />
<br />
<math>\min_{\beta,\beta_0} \phi(\boldsymbol{\beta}, \beta_0)</math> = {distance of all misclassified points}<br />
<br />
The logic is as follows: <br />
<br />
[[File:hyperplane.png|thumb|250px|right| Distance between the point <math>\ x </math> and the decision boundary hyperplane <math>\ L </math> (black line). Note that the vector <math>\ \beta </math> is orthogonal to the decision boundary hyperplane and that points <math>\ x_0, x_1, x_2 </math> are arbitrary points on the decision boundary hyperplane. ]]<br />
<br />
'''1)''' Because a hyper-plane <math>\,L</math> can be defined as <br />
<br />
<math>\, L=\{x: f(x)=\beta^Tx+\beta_0=0\},</math><br />
<br />
<br />
For any two arbitrary points <math>\,x_1 </math> and <math>\,x_2 </math> on <math>\, L</math>, we have<br />
<br />
<math>\,\beta^Tx_1+\beta_0=0</math>,<br />
<br />
<math>\,\beta^Tx_2+\beta_0=0</math>,<br />
<br />
such that <br />
<br />
<math>\,\beta^T(x_1-x_2)=0</math>.<br />
<br />
Therefore, <math>\,\beta</math> is orthogonal to the hyper-plane and it is the normal vector.<br />
<br />
<br />
'''2)''' For any point <math>\,x_0</math> in <math>\ L,</math> <math>\,\;\;\beta^Tx_0+\beta_0=0</math>, which means <math>\, \beta^Tx_0=-\beta_0</math>.<br />
<br />
<br />
'''3)''' We set <math>\,\beta^*=\frac{\beta}{||\beta||}</math> as the unit normal vector of the hyper-plane<math>\, L</math>. For simplicity we call <math>\,\beta^*</math> norm vector. The distance of point <math>\,x</math> to <math>\ L</math> is given by<br />
<br />
<math>\,\beta^{*T}(x-x_0)=\beta^{*T}x-\beta^{*T}x_0<br />
=\frac{\beta^Tx}{||\beta||}+\frac{\beta_0}{||\beta||} <br />
=\frac{(\beta^Tx+\beta_0)}{||\beta||}</math><br />
<br />
Where <math>\,x_0</math> is any point on <math>\ L</math>. Hence, <math>\,\beta^Tx+\beta_0</math> is proportional to the distance of the point <math>\,x</math> to the hyper-plane<math>\, L</math>.<br />
<br />
<br />
'''4)''' The distance from a misclassified data point <math>\,x_i</math> to the hyper-plane <math>\, L </math> is<br />
<br />
<math>\,d_i = -y_i(\boldsymbol{\beta}^Tx_i+\beta_0)</math> <br />
<br />
where <math>\,y_i</math> is a target value, such that <math>\,y_i=1</math> if <math>\boldsymbol{\beta}^Tx_i+\beta_0<0</math>, <math>\,y_i=-1</math> if <math>\boldsymbol{\beta}^Tx_i+\beta_0>0</math><br />
<br />
Since we need to find the distance from the hyperplane to the ''misclassified'' data points, we need to add a negative sign in front. When the data point is misclassified, <math>\boldsymbol{\beta}^Tx_i+\beta_0</math> will produce an opposite sign of <math>\,y_i</math>. Since we need a positive sign for distance, we add a negative sign.<br />
<br />
=== Perceptron Learning using Gradient Descent ===<br />
<br />
The gradient descent is an optimization method that finds the minimum of an objective function by incrementally updating its parameters in the negative direction of the derivative of this function. That is, it finds the steepest slope in the D-dimensional space at a given point, and descends down in the direction of the negative slope. Note that unless the error function is convex, it is possible to get stuck in a local minima.<br />
In our case, the objective function to be minimized is classification error and the parameters of this function are the weights associated with the inputs, <math>\beta</math> . The gradient descent algorithm updates the weights as follows:<br />
<br />
<math>\beta^{\mathrm{new}} \leftarrow \beta^{\mathrm{old}} \rho \frac{\partial Err}{\partial \beta}</math><br />
<br />
<math>\rho </math> is called the ''learning rate''.<br /><br />
The Learning Rate <math> \rho </math> is positively related to the step size of convergence of <math>\min \phi(\boldsymbol{\beta}, \beta_0) </math>. i.e. the larger <math> \rho </math> is, the larger the step size is. Typically, <math>\rho \in [0.1, 0.3]</math>.<br />
<br />
The classification error is defined as the distance of misclassified observations to the decision boundary:<br />
<br />
<br />
To minimize the cost function <math>\phi(\boldsymbol{\beta}, \beta_0) = -\sum\limits_{i\in M} y_i(\boldsymbol{\beta}^Tx_i+\beta_0)</math> where <math>\ M=\{\text {all points that are misclassified}\}</math> <br><br />
<math>\cfrac{\partial \phi}{\partial \boldsymbol{\beta}} = - \sum\limits_{i\in M} y_i x_i </math> and <math> \cfrac{\partial \phi}{\partial \beta_0} = -\sum\limits_{i \in M} y_i</math><br />
<br />
Therefore, the gradient is<br />
<math>\nabla D(\beta,\beta_0)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}x_i \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
<br />
<br />
Using the gradient descent algorithm to solve these two equations, we have<br />
<math>\begin{pmatrix}<br />
\boldsymbol{\beta}^{\mathrm{new}}\\<br />
\beta_0^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix}<br />
\boldsymbol{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
+ \rho<br />
\begin{pmatrix}<br />
y_i x_i\\<br />
y_i<br />
\end{pmatrix}</math><br />
<br />
<br />
If the data is linearly-separable, the solution is theoretically guaranteed to converge to a separating hyperplane in a finite number of iterations. In this situation the number of iterations depends on the learning rate and the margin. However, if the data is not linearly separable there is no guarantee that the algorithm converges. <br />
<br />
<math>\begin{pmatrix}<br />
\beta^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
Note that we consider the offset term <math>\,\beta_0</math> separately from <math>\ \beta</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\ \beta</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions. Many works such as [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00298667 this paper] by ''Cetin et al.'' and [http://indian.cp.eng.chula.ac.th/cpdb/pdf/research/fullpaper/847.pdf this paper] by ''Atakulreka et al.'' have been done to tackle this issue.<br />
<br />
<br />
'''Features'''<br />
* A Perceptron can only discriminate between two classes at a time.<br />
* When data is (linearly) separable, there are an infinite number of solutions depending on the starting point.<br />
* Even though convergence to a solution is guaranteed if the solution exists, the finite number of steps until convergence can be very large.<br />
* The smaller the gap between the two classes, the longer the time of convergence.<br />
* When the data is not separable, the algorithm will not converge (it should be stopped after N steps).<br />
* A learning rate that is too high will make the perceptron periodically oscillate around the solution unless additional steps are taken.<br />
* The L.S compute a linear combination of feature of input and return the sign.<br />
* This were called Perceptron in the engineering literate in late 1950.<br />
* Learning rate affects the accuracy of the solution and the number of iterations directly.<br />
<br />
<br />
'''Separability and convergence'''<br />
<br />
The training set D is said to be linearly separable if there exists a positive constant <math>\,\gamma</math> and a weight vector <math>\,\beta</math> such that <math>\,(\beta^Tx_i+\beta_0)y_i>\gamma </math> for all <math>\,1 < i < n</math>. That is, if we say that <math>\,\beta</math> is the weight vector of Perceptron and <math>\,y_i</math> is the true label of <math>\,x_i</math>, then the signed distance of the <math>\,x_i</math> from <math>\,\beta</math> is greater than a positive constant <math>\,\gamma</math> for any <math>\,(x_i, y_i)\in D</math>.<br />
<br />
<br />
Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data set is linearly separable. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction that it has a negative dot product with, and thus can be bounded above by <math>O(\sqrt{t})</math>where t is the number of changes to the weight vector. But it can also be bounded below by<math>\, O(t)</math>because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector. This can be used to show that the number t of updates to the weight vector is bounded by <math> (\frac{2R}{\gamma} )^2</math> , where R is the maximum norm of an input vector.<ref>http://en.wikipedia.org/wiki/Perceptron</ref><br />
<br />
=== Choosing a Proper Learning Rate ===<br />
[[File:Learning_rate.jpg|500px|thumb|centre|choosing different learning rates affect the performance of gradient descent optimization algorithm.]]<br />
<br />
Choice of a learning rate value will affect the final result of gradient descent algorithm. If the learning rate is too small then the algorithm would take too long to converge which could cause problems for the situations where time is an important factor. If the learning rate is chosen too be too large, then the optimal point can be skipped and never converge. In fact, if the step size is too large, larger than twice the largest eigenvalue of the second derivative matrix (Hessian) of cost function, then gradient steps will go upward instead of downward. <br />
However, the step size is not the only factor than can cause these kind of situations: even with the same learning rate and different initial values algorithm might end up in different situations. In general it can be said that having some prior knowledge could help in choice of initial values and learning rate.<br />
<br />
There are different methods of choosing the step size in an gradient descent optimization problem. The most common method is choosing a fixed learning rate and finding a proper value for it by trial and error. This for sure is not the most sophisticated method, but the easiest one.<br />
Learning rate can also be adaptive; that means the value of learning rate can be different at each step of the algorithm. This can be specially a helpful approach when one is dealing with on-line training and non-stationary environments (i.e. when data characteristics vary over time). In such a case learning rate has to be adapted at each step of the learning algorithm. Different approaches and algorithms for learning rate adaptation can be found in <ref><br />
V P Plagianakos, G D Magoulas, and M N Vrahatis, Advances in convex analysis and global optimization Pythagorion 2000 (2001), Volume: 54, Publisher: Kluwer Acad. Publ., Pages: 433-444.<br />
</ref>.<br />
<br />
The learning rate leading to a local error minimum in the error function in one learning step is optimal. <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
=== Application of Perceptron: Branch Predictor ===<br />
<br />
Perceptron could be used for both online and batch learning. Online learning tasks take place in a sequence of trials. In each round of trial, the learner is given an instance and is asked to use his current knowledge to predict a label for the point. In online learning, the true label of the point is revealed to learner at each round after he makes a prediction. At the last stage of each round the learner has a chance to use the feedback he received on the true label of the instance to help improve his belief about the data for future trials.<br />
<br />
Instruction pipelining is a technique to increase the throughput in modern microprocessor architecture. A microprocessor instruction can be broken into several independent steps. In a single CPU clock cycle, several instructions at different stage can be executed at the same time. However, a problem arises with a branch, e.g. if-and-else- statement. It is not known whether the instructions inside the if- or else- statements will be executed until the condition is executed. This stalls the pipeline.<br />
<br />
A branch predictor is used to address this problem. Using a predictor the pipelined processor predicts the execution path and speculatively executes instructions in the branch. Neural networks are good technique for prediction; however, they are expensive for microprocessor architecture. A research studied the use of perceptron, which is less expensive and simpler to implement, as the branch predictor. The inputs are the history of binary outcomes of the executed branches. The output of the predictor is whether a particular branch will be taken. Every time a branch is executed and its true outcome is known, it can be used to train the predictor. The experiments showed that with a 4 Kb hardware, a global perceptron predictor has a misprediction rate of 1.94%, a superior accuracy. <ref>Daniel A. Jimenez , Calvin Lin, "Neural Methods for Dynamic Branch Prediction", ACM Transactions on Computer Systems, 2002</ref><br />
<br />
== Feed-Forward Neural Networks ==<br />
<br />
* The term 'neural networks' is used because historically, it was used to describe the processes of the brain (e.g. synapses).<br />
<br />
* A neural network is a multistate regression model which is typically represented by a network diagram (see right).<br />
[[Image:Feed-Forward_neural_network.png|right|thumb|300px|Feed Forward Neural Network]]<br />
<br />
* The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.<ref>http://en.wikipedia.org/wiki/Feedforward_neural_network</ref><br />
<br />
* For regression, typically k = 1 (the number of nodes in the last layer), there is only one output unit <math>y_1</math> at the end.<br />
<br />
* For c-class classification, there are typically c units at the end with the cth unit modelling the probability of class c, each <math>y_c</math> is coded as 0-1 variable for the cth class.<br />
<br />
* Neural networks are known as ''universal approximators'', where a two-layer feed-forward neural network can approximate any continuous function to an arbitrary accuracy (assuming sufficient hidden nodes exist and that the necessary parameters for the neural network can be found) <ref name="CMBishop">C. M. Bishop, ''Pattern Recognition and Machine Learning''. Springer, 2006</ref>. It should be noted that fitting training data to a very high accuracy may lead to ''overfitting'', which is discussed later in this course.<br />
<br />
* We often use Perceptron to blocks in Feed-Forward neural networks. We can easily to solve the problem by using Perceptron in many different classes. Feed-Forward neural networks looks like a complicated system of Perceptrons. We can regard the neural networks as an unit or a subset of Neural Network. Feed-Forward neural networks include many hidden layers of perceptron.<br />
<br />
=== Backpropagation (Finding Optimal Weights) === <br />
There are many algorithms for calculating the weights in a feed-forward neural network. One of the most used approaches is the backpropagation algorithm. The application of the backpropagation algorithm for neural networks was popularized in the 1980's by researchers like Rumelhart, Hinton and McClelland (even though the backpropagation algorithm had existed before then). <ref>S. Seung, "Multilayer perceptrons and backpropagation learning" class notes for 9.641J, Department of Brain & Cognitive Sciences, MIT, 2002. Available: [http://hebb.mit.edu/courses/9.641/2002/lectures/lecture04.pdf] </ref><br />
<br />
As the learning part of the network (the first part being feed-forward), backpropagation consists of "presenting an input pattern and changing the network parameters to bring the actual outputs closer to the desired teaching or target values." It is one of the "simplest, most general methods for the supervised training of multilayer neural networks." (pp. 288-289) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
For the backpropagation algorithm, we consider three hidden layers of nodes<br />
<br />
Refer to figure from October 18th lecture where <math>\ l</math> represents the column of nodes in the first column, <br><br />
<math>\ i</math> represents the column of nodes in the second column, and <br><br />
<math>\ k</math> represents the column of nodes in the third column. <br><br />
<br />
We want the output of the feed forward neural network <math>\hat{y}</math> to be as close to the known target value <math>\ y </math> as possible (i.e. we want to minimize the distance between <math>\ y </math> and <math>\hat{y}</math>). Mathematically, we would write it as: <br />
Minimize <math>(\left| y- \hat{y}\right|)^2</math><br />
<br />
Instead of the sign function that has no derivative we use the so called logistic function (a smoothed form of the sign function):<br />
<br />
<math> \sigma(a)=\frac{1}{1+e^{-a}} </math><br />
<br />
<br />
<blockquote> "Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification." <ref>Friedman, J., Hastie, T. and Tibshirani, R. (2008) “The Elements of Statistical Learning”, 2nd ed, Springer.</ref> </blockquote> <br />
<br />
<br />
''Logistic function'' is a common [http://en.wikipedia.org/wiki/Logistic_function sigmoid curve] .It can model the S-curve of growth of some population <math> \sigma</math>. The initial stage of growth is approximately exponential; then, as saturation begins, the growth slows, and at maturity, growth stops. <br />
<br />
<br />
To solve the optimization problem, we take the derivative with respect to weight <math>u_{il}</math>: <br><br />
<math>\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} \cdot \cfrac{\partial a_j}{\partial u_{il}}</math> by Chain rule <br><br />
<math>\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \delta_j \cdot z_l </math> <br />
<br />
where <math> \delta_j = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} </math> which will be computed recursively.<br />
<br />
<math>\ a_i=\sum_{l}z_lu_{il}</math> <br />
<br />
<math>\ z_i=\delta(a_i)</math><br />
<br />
<math>\ a_j=\sum_{i}z_iu_{ji}</math><br><br />
<br />
== Backpropagation Continued (Lecture: Oct. 18, 2011) ==<br />
[[File:Backprop.png|300px|thumb|right|Nodes from three hidden layers within the neural network are considered for the backpropagation algorithm. Each node has been divided into the weighted sum of the inputs <math>\ a </math> and the output of the activation function <math>\ z </math>. The weights between the nodes are denoted by <math>\ u </math>.]]<br />
<br />
From the figure to the right it can be seen that the input (<math>\ a </math>'s) can be expressed in terms of the weighted sum of the outputs of the previous nodes and output (<math>\ z </math>'s) can be expressed as the input as follows:<br />
<br />
<math>\ a_i = \sum_l z_l u_{il} </math><br />
<br />
<math>\ z_i = \sigma(a_i) </math><br />
<br />
<br />
The goal is to optimize the weights to reduce the L2-norm between the target output values <math>\ y </math> (i.e. the correct labels) and the actual output of the neural network <math>\ \hat{y} </math>:<br />
<br />
<math>\left(y - \hat{y}\right)^2</math><br />
<br />
Since the L2-norm is differentiable, the optimization problem can be tackled by differentiating <math>\left(y - \hat{y}\right)^2</math> with respect to each weight in the hidden layers. By using the chain rule we get:<br />
<br />
<math><br />
\cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}}<br />
= \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial u_{il}} = \delta_{i}z_l<br />
</math><br />
<br />
where <math>\ \delta_i = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_i} </math><br />
<br />
The above equation essentially shows the effect of changes in the input <math>\ a_i </math> on the overall output <math>\ \hat{y} </math> as well as the effect of changes in the weights <math>\ u_{il} </math> on the input <math>\ a_i </math>. In the above equation, <math>\ z_l </math> is a known value (i.e. it can be calculated directly), whereas <math>\ \delta_i </math> is unknown but can be expressed as a recursive definition in terms of <math>\ \delta_j</math>:<br />
<br />
<math>\delta_i = \cfrac{\partial (y - \hat{y})^2}{\partial a_i} = \sum_{j} \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_j}\cdot \cfrac{\partial a_j}{\partial a_i} </math><br />
<br />
<math>\delta_i = \sum_{j}\delta_j\cdot\cfrac{\partial a_j}{\partial z_i}\cdot\cfrac{\partial z_i}{\partial a_i}</math><br />
<br />
<math>\delta_i = \sum_{j} \delta_j\cdot u_{ji} \cdot \sigma'(a_i)</math><br />
<br />
where <math> \delta_j = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_j}</math><br />
<br />
The above equation essentially shows the effect of changes in the input <math>\ a_j </math> on the overall output <math>\ \hat{y} </math> as well as the effect of changes in input <math>\ a_i </math> on the input <math>\ a_j </math>. Note that if <math>\sigma(x)</math> is the sigmoid function, then <math>\sigma'(x) = \sigma(x)(1-\sigma(x))</math><br />
<br />
The recursive definition of <math>\ \delta_i </math> can be considered as a cost function at layer <math>i</math> for achieving the original goal of optimizing the weights to minimize <math>\left(y - \hat{y}\right)^2</math>:<br />
<br />
<math>\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}</math>.<br />
<br />
Now considering <math>\ \delta_k</math> for the output layer:<br />
<br />
<math>\delta_k= \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_k}</math>.<br />
<br />
where <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes:<br />
<br />
<math>\delta_k = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial \hat{y}} </math><br />
<br />
<math>\delta_k = -2(y - \hat{y})</math><br /><br />
<math>u_{il} \leftarrow u_{il} - \rho \cfrac{\partial (y - \hat{y})<br />
^2}{\partial u_{il}}</math><br />
<br />
Since <math>\ y </math> is known and <math>\ \hat{y} </math> can be computed for each data point (assuming small, random, initial values for the weights of the neural network), <math>\ \delta_k </math> can be calculated and "backpropagated" (i.e. the <math>\ \delta </math> values for the layer before the output layer can be computed using <math>\ \delta_k </math> and then the <math>\ \delta </math> values for the layer before the layer before the output layer can be computed etc.). Once all <math>\ \delta </math> values are known, the errors due to each of the weights <math>\ u </math> will be known and techniques like gradient descent can be used to optimize the weights. However, as the cost function for <math>\ \delta_i </math> shown above is not guaranteed to be convex, convergence to a global minimum is no guaranteed. This also means that changing the order in which the training points are fed into the network or changing the initial random values for the weights may lead to finding different results for the optimized weights (i.e. different local minima may be reached). <br />
<br />
===Overview of Full Backpropagation Algorithm ===<br />
The network weights are updated using the backpropagation algorithm when each training data point <math>\ x</math>is fed into the feed forward neural network (FFNN). This update procedure is done using the following steps: <br />
<br />
*First arbitrarily choose some random weights (preferably close to zero) for your network.<br />
<br />
*Apply <math>\ x </math> to the FFNN's input layer, and calculate the outputs of all input neurons.<br />
<br />
*Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.<br />
<br />
*Once <math>\ x </math> reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.<br />
<br />
*At the output layer, compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math> for each output neuron(s).<br />
<br />
*Compute each <math> \delta_i </math>, starting from <math>i=k-1</math> all the way to the first hidden layer, where <math>\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}</math>.<br />
<br />
*Compute <math>\cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}} = \delta_{i}z_l</math> for all weights <math>\,u_{il}</math>.<br />
<br />
*Then update <math>u_{il}^{\mathrm{new}} \leftarrow u_{il}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}} </math> for all weights <math>\,u_{il}</math>.<br />
<br />
*Continue for next data points and iterate on the training set until weights converge.<br />
<br />
====Epochs====<br />
It is common to cycle through the all of the data points multiple times in order to reach convergence. An epoch represents one cycle in which you feed all of your datapoints through the neural network. It is good practice to randomized the order you feed the points to the neural network within each epoch; this can prevent your weights changing in cycles. The number of epochs required for convergence depends greatly on the learning rate & convergence requirements used.<br />
<br />
===Limitations===<br />
*The convergence obtained from backpropagation learning is very slow.<br />
<br />
*The convergence in backpropagation learning is not guaranteed.<br />
<br />
*The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.<br />
<br />
*Backpropagation learning requires input scaling or normalization. Inputs are usually scaled into the range of +0.1f to +0.9f for best performance.<ref>http://en.wikipedia.org/wiki/Backpropagation</ref><br />
<br />
*Numerical problems may be encountered when there are a large number of hidden layers, as the errors at each layer may become very small and vanish. <br />
<br />
===Deep Neural Network===<br />
<br />
Increasing the number of units within a hidden layer can increase the "flexibility" of the neural network, i.e. the network is able to fit to more complex functions. Increasing the number of hidden layers on the other hand can increase the "generalizability" of the neural network, i.e. the network is able to generalize well to new data points that it was not trained on. A deep neural network is a neural network with many hidden layers. Deep neural networks were introduced in recent years by the same researchers (Hinton et al. <ref name="HintonDeepNN"> G. E. Hinton, S. Osindero and Y. W. Teh, "A Fast Learning Algorithm for Deep Belief Nets", ''Neural Computation'', 2006. </ref>) that introduced the backpropagation algorithm to neural networks. The increased number of hidden layers in deep neural networks cannot be directly trained using backpropagation, because the errors at each layer will become very small and vanish as stated in the [[#Limitations | limitations]] section. To get around this problem, deep neural networks are trained a few layers at a time (i.e. two layers at a time). This process is still not straightforward as the target values for the hidden layers are not well defined (i.e. it is unknown what the correct target values are for the hidden layers given a data point and a label). ''Restricted Boltzmann Machines (RBM)'' and ''Greedy Learning Algorithms'' have been used to address this issue. For more information about how deep neural networks are trained, please refer to <ref name="HintonDeepNN"/>. A comparison of various neural network layouts including deep neural networks on a database of handwritten digits can be found at [http://yann.lecun.com/exdb/mnist/ THE MNIST DATABASE].<br />
<br />
one of the advantages of Deep Nets is that we can pre-train network using unlabeled data (Unsupervised learning) to obtain initial weights for final<br />
step training using labeled data(fine-tuning). Since most of data available are usually unlabeled data, this method gives us a great chance of finding better local optima than if we just wanted to use labeled data for training the parameters of the network(the weights). for more details on unsupervised pre-training and learning in Deep Nets see<ref><br />
http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf<br />
</ref> , <ref><br />
http://www.cs.toronto.edu/~hinton/absps/tics.pdf<br />
</ref><br />
<br />
An interesting structure of the deep neural network is where the number of nodes in each hidden layer decreases towards the "center" of the network and then increases again. See figure below for an illustration.<br />
<br />
[[File:DeepNNarchitecture.png|500px|thumb|center|A specific architecture for deep neural networks with a "bottleneck".]]<br />
<br />
The central part with the least number of nodes in the hidden layer can be seen a reduced dimensional representation of the input data features. It would be interesting to compare the dimensionality reduction effect of this kind of deep neural network to a cascade of PCA.<br />
<br />
It is known that training DNNs is hard <ref>http://ecs.victoria.ac.nz/twiki/pub/Courses/COMP421_2010T1/Readings/TrainingDeepNNs.pdf</ref> since randomly initializing weights for the network and applying gradient descent can find poor local minimums. In order to better train DNNs, [http://ecs.victoria.ac.nz/twiki/pub/Courses/COMP421_2010T1/Readings/TrainingDeepNNs.pdf Exploring Strategies for Training Deep Neural Networks] looks at 3 principles to better train DNNs:<br />
# Pre-training one layer at a time in a greedy way,<br />
# Using unsupervised learning at each layer,<br />
# Fine-tuning the whole network with respect to the ultimate criterion.<br />
Their experiments show that by providing hints at each layer for the representation, the weights can be initialized such that a more optimal minimum can be reached.<br />
<br />
===Applications of Neural Networks===<br />
* Sales forecasting<br />
* Industrial process control<br />
* Customer research<br />
* Data validation<br />
* Risk management<br />
* Target marketing<br />
<ref><br />
Reference:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Applications of neural networks<br />
</ref><br />
<br />
==Model Selection (Complexity Control)==<br />
<br />
<br />
<br />
Selecting a proper statistical model for a given data set is a well-known problem in pattern recognition and machine learning. Systems with the optimal complexity have a good [http://www.csc.kth.se/~orre/snns-manual/UserManual/node16.html generalization] to yet unobserved data. In the complexity control problem, we are looking for an appropriate model order which gives us the best generalization capability for the unseen data points, while fitting the seen data well. Model complexity here can be defined in terms of over-fitting and under-fitting situations defined in the following section.<br />
<br />
== Over-fitting and Under-fitting ==<br />
[[File:overfitting-model.png|500px|thumb|right| Example of overfitting and underfitting situations. The blue line is a high-degree polynomial which goes through most of the training data points and gives a very low training error, however has a very poor generalization for the unseen data points. The red line, on the other hand, is underfitted to the training data samples.]]<br />
There are two situations which should be avoided in classification and pattern recognition systems:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
In short, Overfitting occurs when the model tries to capture every detail of the data. This can happen if the model has too many parameters compared to the number of observations. Overfitted models have large testing errors but small training error. On the other hand, Underfitting occurs when the model does not capture the complexity of the data. This happens when the model has a large training error, and can be common when there is missing data.<br />
<br />
Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function, and the only goal is to build a model that is as complex as needed to pass through every training data point. <br />
<br />
However, in the real-world, the training data are [http://en.wikipedia.org/wiki/Statistical_noise noisy], i.e. they tend to not lie exactly on the underlying function, instead they may be shifted to unpredictable locations by random noise. If the model is more complex than what it needs to be in order to accurately fit the underlying function, then it would end up fitting most or all of the training data. Consequently, it would be a poor approximation of the underlying function and have poor prediction ability on new, unseen data. <br />
<br />
The danger of overfitting is that the model becomes susceptible to predicting values outside of the range of training data. It can cause wild predictions in multilayer perceptrons, even with noise-free data. To avoid Overfitting, techniques such as Cross Validation and Model Comparison might be necessary. The size of the training set is also important. The training set should have a sufficient number of data points which are sampled appropriately, so that it is representative of the whole data space.<br />
<br />
In a Neural Network, if the number of hidden layers or nodes is too high, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will fit the training set very precisely, but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has a high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
=== Different Approaches for Complexity Control ===<br />
<br />
We would like to have a classifier that minimizes the true error rate <math>\ L(h)</math>:<br />
<br />
<math>\ L(h)=Pr\{h(x)\neq y\}</math><br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|240px|thumb|right| Model complexity]]</span><br />
<br />
Because the true error rate cannot be determined directly in practice, we can try using the empirical true error rate (i.e. training error rate): <br />
<br />
<math>\ \hat L(h)= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
However, the empirical true error rate (i.e. training error rate) is biased downward. Minimizing this error rate does not find the best classifier model, but rather ends up overfitting to the training data. Thus, this error rate cannot be used.<br /><br />
<br />
The complexity of a fitting model depends on the degree of the fitting function. According to the graph, the area on the LHS of the critical point is considered as under-fitting. This inaccuracy is resulted by the low complexity of fitting. The area on the RHS of the critical point is over-fitting, because it's not generalized.<br /><br />
<br />
As illustrated in the figure to the right, the training error rate is always less than the true error rate, i.e. "biased downward". Also, the training error will always decrease with an increase in the complexity of the model used to fit the data. This does not reflect the behavior of the true error rate. The true error rate will have a unique minimum as the model complexity changes. <br />
<br />
So, if the training error rate is the only criteria used for picking a model, overfitting can occur. An overfitted model has low training error rate, but is not able to generalize well to new test data points. On the other hand, underfitting can occur when a model that is not complex enough is picked (e.g. using a first order model for data that follows a second order trend). Both training and test error rates will be high in that case. The best choice for the model complexity is where the true error rate reaches its minimum point. Thus, model selection involves ''controlling the complexity'' of the model. The true error rate can be approximated using the test error rate, i.e. the test error follows the same trend that the true error rate does when the model complexity is changed. <br />
In this case, we assume there is a test data set <math>\,x_1, . . . ,x_n</math> and these points follow some unknown distribution. In order to find out this distribution, we can make some estimationg of some unknown parameters, such as <math>\,f</math>, the mean <math>\,E(x_i)</math>, the variance <math>\,var(x_i)</math> and more.<br />
<br />
To estimate <math>\,f</math>, we use an observation function as our estimator. <br />
<br />
<math>\hat{f}(x_1,...,x_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]=Variance (\hat f)+Bias^2(\hat f )</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
This estimator is unbiased.<br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f=0</math><br />
<br />
which means that we just need to minimize <math>MSE (\hat{f})</math>.<br />
<br />
<math>\implies MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. <br />
<br />
Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
<br />
<br />
<br />
In order to avoid overfitting, there are two main strategies:<br />
<br />
# ''Estimate the error rate''<br />
## Cross-validation<br />
## Computing error bound ( probability in-equality )<br />
# ''Regulazition''<br />
## We basically make the function (model) smooth by limiting the complexity or by limiting the size of weights.<br />
<br />
===Cross Validation===<br />
<br />
[[File:k-fold.png|350px|thumb|right|Graphical illustration of 4-fold cross-validation. V is the part used for validation and T is used for training.]]<br />
<br />
Cross-validation is an approach for avoiding overfitting while modelling data that bases the choice of model parameters on a portion of the training set, while using the rest of the set for validation, i.e., some of the data is left out when fitting the model. One round of the process involves partitioning the data set into two complementary subsets, fitting the model to one subset (called the training set), and testing the model against the other subset (called the validation or testing subset). This is usually repeated several times using different partitions in order to reduce variability, and the validation results are then averaged over the rounds.<br />
<br />
====LOO: Leave-one-out cross-validation ====<br />
When the dataset is very small, leaving one tenth out depletes our data too much, but making the validation set too small makes the estimate of the true error unstable (noisy). One solution is to do a kind of round-robin validation: for each complexity setting, learn a classifier on all the training data minus one example and evaluate its error the remaining example. Leave-one-out error is defined as:<br />
<br />
'''LOO error''': <math>\frac {1}{n} \sum_{i} 1 (h(x_i; D_-i)\neq y_i)</math><br />
where <math>D_-i</math> is the dataset minus ith example and <math>h(x_i; D_-i)</math> is the classifier learned on <math>D_-i</math>. LOO error is an unbiased estimate of the error of our learning algorithm (for a given complexity setting) when given <math>n-1</math> examples.<br />
<br />
====K-Fold Cross Validation====<br />
<br />
Instead of minimizing the training error, here we minimize the validation error.<br /><br />
<br />
A common type of cross-validation that is used for relatively small data sets is K-fold cross-validation, the algorithm for which can be stated as follows:<br />
<br />
Let h denote a classification model to be fitted to a given data set.<br />
<br />
# Randomly partition the original data set into K subsets of approximately the same size. A common choice for K is K = 10.<br />
# For k = 1 to K do the following<br />
## Remove subset k from the data set<br />
## Estimate the parameters of each different classification model based only on the remaining data points. Denote the resulting function by h(k)<br />
## Use h(k) to predict the data points in subset k. Denote by <math>\begin{align}\hat L_k(h)\end{align}</math> the observed error rate.<br />
# Compute the average error <math>\hat L(h) = \frac{1}{K} \sum_{k=1}^{K} \hat L_k(h)</math><br />
<br />
The best classifier is the model that results in the lowest average error rate.<br />
<br />
A common variation of k-fold cross-validation uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is then repeated such that each sample is used once for validation. It is the same as a K-fold cross-validation with K being equal to the number of points in the data set, and is referred to as leave-one-out cross-validation. <ref> stat.psu.edu/~jiali/course/stat597e/notes2/percept.pdf</ref><br />
<br />
====Alternatives to Cross Validation for model selection:====<br />
# Akaike Information Criterion (AIC): This approach ranks models by their AIC values. The model with the minimum AIC is chosen. The formula of AIC value is: <math>AIC = 2k + 2log(L_{max})</math>, where <math>k</math> is the number of parameters and <math>L_{max}</math> is the maximum value of the likelihood function of the model. This selection method penalizes the number of parameters.<ref>http://en.wikipedia.org/wiki/Akaike_information_criterion</ref><br />
# Bayesian Information Criterion (BIC): It is similar to AIC but penalizes the number of parameters even more. The formula of BIC value is: <math>BIC = klog(n) - 2log(L)</math>, where <math>n</math> is the sample size.<ref>http://en.wikipedia.org/wiki/Bayesian_information_criterion</ref><br />
<br />
== Model Selection Continued (Lecture: Oct. 20, 2011) ==<br />
<br />
=== Error Bound Computation ===<br />
Apart from cross validation, another approach for estimating the error rates of different models is to find a bound to the error. This works well theoretically to compare different models, however, in practice the error bounds are not a good indication of which model to pick because the error bounds are not ''tight''. This means that the actual error observed in practice may be a lot better than what was indicated by the error bounds. This is because the error bounds indicate the worst case errors and by only comparing the error bounds of different models, the worst case performance of each model is compared, but not the overall performance under normal conditions. <br />
<br />
=== Penalty Function ===<br />
Another approach for model selection to avoid overfitting is to use ''regularization''. Regularization involves adding extra information or restrictions to the problem in order to prevent overfitting. This additional information can be in the form of a function penalizing high complexity (penalty function). So in regularization, instead of minimizing the squared error alone we attempt to minimize the squared error plus a penalty function. A common penalty function is the euclidean norm of the parameter vector multiplied by some scaling parameter. The scaling parameter allows for balancing the relative importance of the two terms. <br /> This means minimizing the following new objective function:<br /><br />
<math> \left|y-\hat{y}\right|^2+f(\theta)</math><br /><br />
where <math>\ \theta</math> is model complexity and <math>\ f(\theta)</math> is the penalty function. The penalty function should increase as the model increases in complexity. This way it counteracts the downward bias of the training error rate. There is no optimal choice for a penalty function but they should all increase all the complexity and size of the estimates increase. <br />
<br />
There is no optimal choice for the penalty function but they all seek to solve the same problem. Suppose you have models of order 1,2,...,K such that the models of class k-1 are a subset of the models in class k. An example of this is linear regression where a model of order k is the model with the first k explanatory covariates. If you do not include a penalty term and minimize the squared error alone you will always choose the largest most complex model (K). But the problem with this is the gain from including more complexity might be incredibly small. The gain in accuracy may in fact be no better than you would expect from including a covariate drawn from a N(0,1) distribution. If this is the case then clearly we don't want to include such a covariate. And in general if the increase in accuracy is below a certain level then it is preferable to stay with the simpler model. By adding a penalty term, no matter how small it is, you know at least at some point these insignificant gains in accuracy will be outweighed by increase in penalty. By effectively choosing and scaling your penalty function you can have your objective function approximate the true error as opposed to the training error.<br />
<br /><br />
<br />
==== Example: Penalty Function in Neural Network Model Selection ====<br />
<br />
In MLP neural networks, the activation function is of the form of a logistic function, where the function behaves almost linearly when the input is close to zero (i.e., the weights of the neural network are close to zero), while the function behaves non-linearly as the magnitude of the input increases (i.e., the weights of the neural network become larger). In order to penalize additional model complexity (i.e., unnecessary non-linearities in the model), large weights will be penalized by the penalty function.<br />
<br />
The objective function to minimize with respect to the weights <math>\ u_{ji}</math> is:<br /><br />
<br />
<math>\ Reg=\left|y-\hat{y}\right|^2 + \lambda*\sum_{i=1}^{n}(u_{ji})^2</math> <br />
If the weight start to grow, then <math>\sum_{i=1}^{n}(u_{ji})^2</math> becomes larger and <math>\left|y-\hat{y}\right|^2</math> becomes smaller.<br />
<br />
The derivative of the objective function with respect to the weights <math>\ u_{ji}</math> is:<br /><br />
<math>\cfrac{\partial Reg}{\partial u_{ji}} = \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji}</math> <br />
<br />
This objective function is used during [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. In practice, cross validation is used to determine the value of <math>\ \lambda</math> in the objective function.<br /><br />
<br />
We can do CV to choose <math>\lambda</math>. In case of any model, the least "complex" is the linear model. gradually let the complexity to grow then complexity begins to rise.<br />
<br />
We want non-linear model but not too curvy.<br />
<br />
==== Penalty Functions in Practice ====<br />
In practice, we only apply the penalty function to the parametrized terms. That is, the bias term is not regularized, since it is simply the DC component and is not associated with a feature. Although this makes little difference, the concept is clear that the bias term should not be considered when determining the relative weights of the features.<br />
<br />
In particular, we update the weights as follows:<br />
<br />
<math><br />
u_{ji} := <br />
\begin{cases} <br />
u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}} &bias term\\<br />
u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji} &otherwise<br />
\end{cases}<br />
</math><br />
<br />
== Radial Basis Function Neural Network (RBF NN) ==<br />
[http://en.wikipedia.org/wiki/Radial_basis_function_network Radial Basis Function Network](RBF) NN is a type of neural network with only one hidden layer in addition to an input and output layer. Each node within the hidden layer uses a radial basis activation function, hence the name of the RBF NN. A radial basis function is a real-valued function whose value depends only on the distance from center. One of the most commonly used radial basis functions is Gaussian. The weights from the input layer to the hidden layer are always "1" in a RBF NN, while the weights from the hidden layer to the output layer are adjusted during training. The output unit implements a weighted sum of hidden unit outputs. The input into an RBF NN is nonlinear while the output is linear. Due to their nonlinear approximation properties, RBF NNs are able to model complex mappings, which perceptron based neural networks can only model by means of multiple hidden layers. It can be trained without back propagation since it has a closed-form solution. RBF NNs have been successfully applied to a large diversity of applications including interpolation, chaotic time series modeling, system identification, control engineering, electronic device parameter modeling, channel equalization, speech recognition, image restoration, shape-form-shading, 3-D object modeling, motion estimation and moving object segmentation, data fusion, etc. <ref>www-users.cs.york.ac.uk/adrian/Papers/Others/OSEE01.pdf</ref><br />
<br />
====The Network System====<br />
<br />
1. Input: <br />n data points <math>\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n</math><br /><br />
2. Basis function ('''the single hidden layer'''): <br /><br />
<math>\mathbf{\phi}_{n*m}</math>, where <math>m</math> is the number of the neurons/basis functions that project original data points into a new space. <br /><br />
There are many choices for the basis function. The commonly used is radial basis:<br /><br />
<math>\phi_j(\mathbf{x}_i)=e^{-|\mathbf{x}_i-\mathbf{\mu}_j|^2}</math><br /><br />
3. Weights associated with the last layer: <math>\mathbf{W}_{m*k}</math>, where k is the number of classes in the output <math>\mathbf{Y}</math>.<br /><br />
4. Output: <math>\mathbf{Y}</math>, where<br /><br />
<math>y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))</math><br /><br />
Alternatively, the output <math>\mathbf{Y}</math> can be written as<br />
<math><br />
Y=\phi*W<br />
</math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x}_1) & \phi_{2}(\mathbf{x}_1) & \cdots & \phi_{m}(\mathbf{x}_1) \\<br />
\phi_{1}(\mathbf{x}_2) & \phi_{2}(\mathbf{x}_2) & \cdots & \phi_{m}(\mathbf{x}_2) \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{1}(\mathbf{x}_n) & \phi_{2}(\mathbf{x}_n) & \cdots & \phi_{m}(\mathbf{x}_n)<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors. If m = n, then <math>\mathbf{\mu}_i = \mathbf{x}_i</math>, so <math>\phi_{i}</math> checks to see how similar the two data points are.<br />
<br />
<math>Y=\phi W</math> where Y and <math>\phi</math> are known while W is unknown.<br />
The object function is <math>\psi=|Y-\Phi W|^2 </math> and we want to <math> \underset{W}{\mbox{min}} |Y-\Phi W|^2 </math>. Therefore, to get the optimal weight, <math>W=(\phi^T \phi)^{-1}\phi^TY</math><br />
<br />
==== Network Training====<br />
To construct m basis functions, first cluster data points into m groups. Then find the centre of each cluster <math>\mu_1</math> to <math>\mu_m</math>.<br /><br />
<br />
'''Clustering: the K-means algorithm''' <ref>This section is taken from Wikicourse notes stat441/841 fall 2010.</ref><br /><br />
K-means is a commonly applied technique in clustering observations into groups by minimizing the distance of individual observations from the center of the cluster it is in. The most common K-means algorithm used is referred to as [http://en.wikipedia.org/wiki/Lloyd%27s_algorithm Lloyd's algorithm]: <br /><br />
<br />
# Select the number of clusters m <br /><br />
<br />
# Randomly select m observations from the n observations, to be used as m initial centers. <br /><br />
<br />
# (Alternative): Randomly assign all data points to clusters and use the means of those clusters as the initial centers. <br /><br />
<br />
# For each data point from the rest of observations, compute the distance to each of the initial centers and classify it into the cluster with the minimum distance. <br /><br />
<br />
# Obtain updated cluster centers by computing the mean of all the observations in the corresponding clusters.<br /> <br />
<br />
# Repeat Step 3 and Step 4 until all of the differences between the old cluster centers and new cluster centers are acceptable.<br /><br />
<br />
Note: K means can be sensitive to the originally selected points, so it may be useful to run K-means repeatedly and use prior knowledge to select the best cluster.<br />
<br />
Having constructed the basis functions, next minimize the objective function with respect to <math>\mathbf{W}</math>:<br /><br />
<math> min \;\left|| Y-\phi*W\right ||_2^{2}</math><br />
<br />
The solution to the problem is<br />
<math>\ <br />
W=(\phi^T*\phi)^{-1}*\phi^T*Y <br />
</math><br />
<br />
Matlab example:<br />
<br />
clear all;<br />
clc;<br />
load ionosphere.mat;<br />
P=ionosphere(:,1:(end-1));<br />
P=P';<br />
T=ionosphere(:,end);<br />
T=T';<br />
net=newff(minmax(P),[4,1],{'logsig','purelin'},'trainlm'); <br />
net.trainParam.show=100;<br />
net.trainParam.mc=0.9; <br />
net.trainParam.mu=0.05; <br />
net.trainParam.mu_dec=0.1;<br />
net.trainParam.mu_inc=5;<br />
net.trainParam.lr=0.5;<br />
net.trainParam.goal=0.01; <br />
net.trainParam.epochs=5000; <br />
net.trainParam.max_fail=10;<br />
net.trainParam.min_grad=1e-20; <br />
net.trainParam.mem_reduc=2;<br />
net.trainParam.alpha=0.1;<br />
net.trainParam.delt_inc=1;<br />
net.trainParam.delt_dec=0.1;<br />
net=init(net); <br />
net,tr]=train(net,P,T); <br />
A = sim(net,P); <br />
E = T - A; <br />
disp('the training error:')<br />
MSE=mse(E)<br />
<br />
===Single Basis Function vs. Multiple Basis Functions===<br />
Suppose the data points belong to a mixture of Gaussian distributions.<br /><br />
<br />
Under '''single basis''' function approach, every class in <math>\mathbf{Y}</math> is represented by a single basis function. This approach is similar to the approach of linear discriminant analysis. <br /><br />
<br />
Compare <math>y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))</math><br /><br />
with <math>P(Y|X)=\frac{P(X|Y)*P(Y)}{P(X)}</math>. <br /> Here, the basis function <math>\mathbf{\phi}_{j}</math> can be thought of as equivalent to <math>\frac{P(X|Y)}{P(X)}</math>.<br /><br />
<br />
Under '''multiple basis''' function approach, a layer of j basis functions are placed between <math>\mathbf{Y}</math> and <math>\mathbf{X}</math>. The probability function of the joint distribution of <math>\mathbf{X}</math>, <math>\mathbf{J}</math> and <math>\mathbf{Y}</math> is<br />
<br />
<math>\,P(X,J,Y)=P(Y)*P(J|Y)*P(X|J)</math><br />
<br />
Here, instead of using single Gaussian to represent each class, we use a "mixture of Gaussian" to represent.<br /><br />
The probability funcion of <math>\mathbf{Y}</math> conditional on <math>\mathbf{X}</math> is<br />
<br />
<math>P(Y|X)=\frac{P(X,Y)}{P(X)}=\frac{\sum_{j}{P(X,J,Y)}}{P(X)}</math><br />
<br />
Multiplying both the nominator and the denominator by <math>\ P(J) </math> yields <br />
<br />
<math>\ P(Y|X)=\sum_{j}{P(J|X)*P(Y|J)}</math><br /><br />
where <math>\ P(J|X)</math> tells that, with given X (data), how likely the data is in the Gaussian J, and <math>\ P(Y|J)</math> tells that, with given Gaussian J, how likely this Gaussian belongs to class K.<br />
<br />
<br />
since<br /><br />
<math>\ P(J|X)=\frac{P(X|J)*P(J)}{P(X)}</math> <br />
and <math>\ P(Y|J)=\frac{P(Y|J)*P(Y)}{P(J)}</math><br />
<br />
If the weights in the radial basis neural network have proper properties of probability function, then the basis function <math>\mathbf{\phi}_j</math> can be thought of as <math>\ P(J|X)</math>, representing the probability that <math>\mathbf{x}</math> is in Gaussian class j; and the weight function W can be thought of as <math>\ P(Y|J)</math>, representing the probability that a data point belongs to class k given that the point is from Gaussian class j.<br /><br />
<br />
In conclusion, given a mixture of Gaussian distributions, multiple basis function approach is better than single basis function, since the former produces a non-linear boundary.<br />
<br />
== RBF Network Complexity Control (Lecture: Oct. 25, 2011) ==<br />
<br />
When performing model selection, overfitting is a common issue. As model complexity increases, there comes a point where the model becomes worse and worse at fitting real data even though it fits the training data better. It becomes too sensitive to small perturbations in the training data that should be treated as noise to allow flexibility in the general case. In this section we will show that training error (empiricial error from the training data) is a poor estimator for true error and that minimizing training error will increase complexity and result in overfitting. We will show that test error (empirical error from the test data) is a better estimator of true error. This will be done by estimating a model <math> \hat f </math> given training data <math> T={(x_i,y_i)}^n_{i=1}</math>.<br />
<br />
<br />
First, some notation is defined. <br />
<br />
The assumption for the training data set is that it consists of the true model values <math>\ f(x_i) </math> plus some additive Gaussian noise <math>\ \epsilon_i </math>:<br />
<br />
<math>\ y_i = f(x_i)+\epsilon_i</math> where <math>\ \epsilon \sim N(0,\sigma^2)</math><br />
<br />
<math>\ y_i = true\,model + noise</math><br />
<br />
===Important Notation===<br />
<br />
Let:<br />
*<math>\displaystyle f(x)</math> denote the ''true model''.<br />
*<math>\hat f(x)</math> denote the ''prediction/estimated model'', which is generated from a training data set <math>\displaystyle T = \{(x_i, y_i)\}^n_{i=1}</math>. The observation is not accurate.<br /><br />
Remark: <math>\hat f(x_i) = \hat y_i</math>.<br /><br />
*<math>\displaystyle err</math> denote the ''empirical error'' based on actual data points. This can be either test error or training error depending on the data points used. This is the difference between <math>(y-\hat{y})^2 </math><br />
*<math>\displaystyle Err </math> denote the ''true error'' or ''generalization error'', and is what we are trying to minimize. It is the difference between <math>(f-\hat{f})^2 </math><br />
*<math>\displaystyle MSE=E[(\hat f(x)-f(x))^2]</math> denote the ''mean squared error''.<br />
<br />
We use the training data to estimate our model parameters.<br />
<br />
<math>D=\{(x_i,y_i)\}_{i=1}^n</math><br />
<br />
<br />
For a given point <math>y_0</math>, the expectation of the empirical error is:<br />
<br />
<math> \begin{align}<br />
<br />
E[(\hat{y_0}- y_0)^2] &= E[(\hat{f_0}- f_0 -\epsilon_0)^2] \\<br />
&=E[(\hat{f_0}-f_0)^2 + \epsilon_0^2 - 2 \epsilon_0 (\hat{f_0}-f_0)] \\<br />
&=E[(\hat{f_0}-f_0)^2] + E[\epsilon_0^2] - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] \\<br />
&=E[(\hat{f_0}-f_0)^2] + \sigma^2 - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] <br />
\end{align}<br />
</math><br />
<br />
This is the formula partitions the training error into the true error and others errors. Our goal is to select the model that minimizes the true error so we must try to understand the effects of these other error terms if we are to use training error as a estimate for the true error. <br />
<br />
The first term is essentially true error. The second term is a constant. The third term is problematic, since in general this expectation is not 0. We will break this into 2 cases to simplify the third term.<br />
<br />
=====Case 1: Estimating Error using Data Points from Test Set=====<br />
In Case 1, the empirical error is test error and the data points used to calculate test error are from the test set, not the training set. That is, <math>y_0 \notin T </math>.<br />
<br />
We can rewrite the third term in the following way, since both <math>y_0</math> and <math>\hat{f_0}</math> have expectation <math>f_0</math>, the true value, which is a constant and not random.<br />
<br />
<math> \begin{align} <br />
E [ \epsilon_0 (\hat{f_0}-f_0)] &= E [ (y_0-f_0) (\hat{f_0}-f_0)] \\<br />
& = cov{(y_0,\hat{f_0})}<br />
\end{align}<br />
</math><br />
<br />
(The reason why covariance is here since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent.)<br />
<br />
Consider <math>\ f_0 </math> is a mean.<br />
<br />
Since <math>y_0</math> is not part of the training set, it is independent of the model <math>\hat{f_0}</math> generated by the training set. Therefore,<br />
<br />
<math>y_0 \notin T \to y_0 \perp \hat{f} </math><br />
<br />
<math>\ cov{(y_0,\hat{f}_0)}=0</math><br />
<br />
<br />
The equation for the expectation of empirical error simplifies to the following:<br />
<br />
<math>E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2 </math><br />
<br />
<br />
This result applies to every output value in the test data set, so we can generalize this equation by summing over all m data points that have NOT been seen by the model:<br />
<br />
<math>\begin{align}<br />
\sum_{i=1}^m{(y_i-\hat{y_i})^2} &= \sum_{i=1}^m{(f_i-\hat{f_i})^2)} + m \sigma^2 \\<br />
err &= Err + m \sigma^2 \\<br />
& = Err + constant\\<br />
\end{align}<br />
</math><br />
<br />
Rearranging to solve for true error, we get<br />
<br />
<math>\ Err = err - m \sigma^2</math><br />
<br />
We see that test error is a good estimator for true error upto a constant additive value, since they only differ by a constant. Minimizing test error is equal to minimize true error. Moreover, the true error is less than the empirical error. There is no term adding unnecessary complexity. This is the justification for Cross Validation.<br />
<br />
To avoid over-fitting or under-fitting using cross-validation, a validation data set selected so that it is independent from the estimated model.<br />
<br />
===Case 2: Estimating Error using Data Points from Training Set===<br />
<br />
In Case 2, the data points used to calculate error are from the training set, so <math>\ y_0 \in T </math>, i.e. <math>\ (x_i, y_i)</math> is in the training set. We will show that this results in a worse estimator for true error.<br />
<br />
Now <math>\ y_0</math> has been used to estimate <math>\ \hat{f}</math> so they are not independent. We use [http://en.wikipedia.org/wiki/Stein's_lemma Stein's lemma] to simplify the term <math>\ E[\epsilon_0 (\hat{f_0} - f_0)]</math>.<br />
<br />
Stein's Lemma states that if <math>\ x \sim N(\theta,\sigma^2)</math> and <math>\ g(x)</math> is differentiable, then <br />
<br />
<math>E\left[g(x) (x - \theta)\right] = \sigma^2 E \left[ \frac{\partial g(x)}{\partial x} \right] </math><br />
<br />
Substitute <math>\ \epsilon_0</math> for <math>\ x</math> and <math>\ (\hat{f_0}-f_0)</math> for <math>\ g(x)</math>. Note that <math>\ \hat{f_0}</math> is a function of the noise, since as noise changes, <math>\hat{f_0}</math> will change. Using Stein's Lemma, we get:<br />
<br />
<math><br />
\begin{align}<br />
E[\epsilon_0 (\hat{f_0}-f_0)] &= \sigma^2 E \left[ \frac{\partial (\hat{f_0}-f_0)}{\partial \epsilon_0} \right]\\<br />
&=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial \epsilon_0}\right]\\<br />
&=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial y_0}\right]\\<br />
&=\sigma^2 E\left[D_0\right]<br />
\end{align}<br />
</math><br />
<br />
<br />
Remark: <math> \frac{\partial (\hat{f_0} - f_0)}{\partial y_0} = \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial \epsilon_0}{\partial y}<br />
= \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial (y_0 - \hat{y_0})}{\partial y} </math> <br /><br />
<br />
The reason why <math> \frac{\partial (\hat{f_0})}{\partial \epsilon_0} = 0 </math> is that <math>f_0</math> is a constant instead of a function.<br /> <br />
<br />
where <math> \frac{\partial (y_0 - \hat{y_0})}{\partial y} = 1 </math> <br />
<br />
<br />
We take <math>\ D_0 = \frac{\partial \hat{f_0}}{\partial y_0}</math>, where <math>\ D_0</math> represents the derivative of the fitted model with respect to the observations. The equation for the expectation of empirical error becomes:<br />
<br />
<math>E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2 - 2 \sigma^2 E[D_0] </math><br />
<br />
Generalizing the equation for all n data points in the training set:<br />
<br />
<math><br />
\sum_{i=1}^n{(y_i-\hat{y_i})^2} = \sum_{i=1}^n{(f_i-\hat{f_i})^2} + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}<br />
</math><br />
<br />
Based on the notation defined above, we then have:<br />
<br />
<math><br />
err = Err + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}<br />
</math><br />
<br />
<math>Err = err - n \sigma^2 + 2 \sigma^2 \sum_{i=1}^n{D_i}</math><br />
<br />
This equation for the true error is called [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimator (SURE)]. It is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter and thus cannot be determined completely. <br />
<br />
Note that <math>\ D_i</math> depends on complexity of the model. It measures how sensitive the model is to small perturbations in a single <math>\ y_i</math> in the training set. As complexity increases, the model will try to chase every little change and will be more sensitive to such perturbations. Minimizing training error without accounting for the impact of this term will result in overfitting. Thus, we need to know how to find <math>\ D_i</math>. Below we show an example, applying SURE to RBFs, where computing <math>\ D_i</math> is straightforward.<br />
<br />
=== SURE for RBF Network Complexity Control===<br />
Problem: Assuming we want to fit our data using a radial basis function network, how many radial basis functions should be used? The network size has to compromise the approximation quality, which usually improves as the network grows, and the training effort, which increases with the network size. Moreover, too complex models can show insufficient generalization properties (overfitting) requiring small networks. Furthermore, in terms of hardware or software realization smaller networks occupy less area due to reduced memory needs. Hence, controlling the network size is one major task during training. For further information about RBF network complexity control check [http://www.dice.ucl.ac.be/Proceedings/esann/esannpdf/es2007-13.pdf]<br />
<br />
We can use Stein's unbiased risk estimator (SURE) to give us an approximation for how many RBFs to use.<br />
<br />
The SURE equation is<br />
<br />
<math>\mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2\sum_{i=1}^n D_i</math><br />
<br />
where <math>\ Err </math> is the true error, <math>\ err </math> is the empirical error, <math>\ n</math> is the number of training samples, <math>\ \sigma^2</math> is the variance of the noise of the training samples and <math>\ D_i</math> is derivative of the model output with respect to true output as shown below<br />
<br />
<math>D_i=\frac{\partial \hat{f_i}}{\partial y_i}</math><br />
<br />
Optimal Number of Basis in RBF<br />
<br />
The optimal number of basis functions should be rearranged in order to minimize the generalization error <math>\ err </math>.<br />
<br />
The formula for an RBF network is:<br />
<br />
<math>\hat{f}=\Phi W</math><br />
<br />
where <math>\ \hat{f}</math> is a matrix of RBFN outputs for each training sample, <math>\ \Phi</math> is the matrix of neuron outputs for each training sample, and <math>\ W</math> is the weight vector between each neuron and the output. Suppose we have m + 1 neurons in the network, where one has a constant function.<br />
<br />
Given the training labels <math>\ Y</math> we define the empirical error and minimize it<br />
<br />
<math>\underset{W}{\mbox{min}} |Y-\Phi W|^2</math><br />
<br />
<math>\, W=(\Phi^T \Phi)^{-1} \Phi^T Y</math><br />
<br />
<math>\hat{f}=\Phi(\Phi^T \Phi)^{-1} \Phi^T Y</math><br />
<br />
<br />
For simplification let <math>\ H</math> be the ''hat matrix'' defined as<br />
<br />
<math>\, H=\Phi(\Phi^T \Phi)^{-1} \Phi^T</math><br />
<br />
Our optimal output then becomes<br />
<br />
<math>\hat{f}=H Y</math><br />
<br />
We calculate <math>D</math> from the SURE equation. We now consider applying SURE to Radial Basis Function networks specifically. Based on SURE, the optimum number of basis functions should be assigned so that the generalization error <math>\displaystyle err</math> is minimized. Based on the RBF Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero we obtain the least squares solution of <math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>. Then the fitted values are <math>\hat{Y} = \hat{f} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math>, where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math> is the hat matrix for this model.<br />
<br />
<br />
Consider only one node of the network. In this case we can write:<br />
<math>\hat f_i=\,H_{i1}y_1+\,H_{i2}y_2+\cdots+\,H_{ii}y_i+\cdots+\,H_{in}y_n</math>.<br />
<br />
Note here that <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on the observation <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\ \hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can readily obtain:<br />
<br />
<math>\sum_{i=1}^n \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^n \,H_{ii}</math><br />
<br />
<math>D_i= \frac{\partial \hat f_i}{\partial y_i}=\frac{\partial [HY]_i}{\partial y_i} </math> , <math>\hat f_i=\sum_{j}\,H_{ij}*Y_j</math><br />
<br />
<br />
Here we recall that <math>\sum_{i=1}^n\,D_{i}= \sum_{i=1}^n \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Using the permutation property of the trace function we can further simplify the expression as follows:<br />
<math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1}\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=m</math>, by the trace cyclical permutation property, where <math>\displaystyle m</math> is the number of basis functions in the RBF network (and hence <math>\displaystyle \Phi</math> has dimension <math>\displaystyle n \times m</math>).<br><br />
<br />
====Sketch of Trace Cyclical Property Proof:====<br />
For <math>\, A_{mn}, B_{nm}, Tr(AB) = \sum_{i=1}^{n}\sum_{j=1}^{m}A_{ij}B_{ji} = \sum_{j=1}^{m}\sum_{i=1}^{n}B_{ji}A_{ij} = Tr(BA)</math>.<br><br />
With that in mind, for <math>\, A_{nn}, B_{nn} = CD, Tr(AB) = Tr(ACD) = Tr(BA)</math> (from above) <math>\, = Tr(CDA)</math>.<br><br><br />
<br />
Note that since <math>\displaystyle \Phi</math> is a projection of the input matrix <math>\,X</math> onto a basis set spanned by <math>\,m</math>, the number of basis functions, that sometimes an extra <math>\displaystyle \Phi_0</math> term is included without any input to represent the intercept of a fitted model. In this case, if considering an intercept, then <math>\,Trace(H)= m+1</math>.<br />
<br />
<br />
The SURE equation then becomes<br />
<br />
<math>\, \mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2(m+1)</math><br />
<br />
As the number of RBFs <math>\ m</math> increases the empirical error <math>\ err</math> decreases, but the right term of the SURE equation increases. An optimal true error <math>\ Err </math> can be found by increasing <math>\ m</math> until <math>\ Err </math> begins to grow. At that point the estimate to the minimum true error has been reached.<br />
<br />
The value of m that gives the minimum true error estimate is the optimal number of basis functions to be implemented in the RBF network, and hence is also the optimal degree of complexity of the model. <br />
<br />
One way to estimate the noise variance is<br />
<br />
<math>\hat{\sigma}^2=\frac{\sum (y-\hat{y})^2}{n-1}</math><br />
<br />
This application of SURE is straightforward because minimizing Radial Basis Function error reduces to a simple least squares estimator problem with a linear solution. This makes computing <math>\ D_i</math> quite simple. In general, <math>\ D_i</math> can be much more difficult to solve for.<br />
<br />
=== RBF Network Complexity Control (Alternate Approach) ===<br />
<br />
An alternate approach (not covered in class) to tackling RBF Network complexity control is controlling the complexity by similarity <ref name="Eickhoff">R. Eickhoff and U. Rueckert, "Controlling complexity of RBF networks by similarity," ''Proceedings of European Symposium on Artificial Neural Networks'', 2007</ref>. In <ref name="Eickhoff" />, the authors suggest looking at the similarity between the basis functions multiplied by their weight by determining the cross-correlations between the functions. The cross-correlation is calculated as follows:<br />
<br />
<math>\ \rho_{ij} = \frac{E[g_i(x)g_j(x)]}{\sqrt(E[g^2_i(x)]E[g^2_j(x)])} </math><br />
<br />
where <math>\ E[] </math> denotes the expectation and <math>\ g_i(x) </math> and <math>\ g_j(x) </math> would denote two of the basis functions multiplied by their respective weights.<br />
<br />
If the cross-correlation between two functions is high, <ref name="Eickhoff" /> suggests that the two basis functions be replaced with one basis function that covers the same region of both basis functions and that the corresponding weight of this new basis function be the average of the weights of the two basis functions. For the case of Gaussian radial basis functions, the equations for finding the new weight (<math>\ w_{new} </math>), mean (<math>\ c_{new} </math>) and variance (<math>\ \sigma_{new} </math>) are as follows:<br />
<br />
<math>\ w_{new} = \frac{w_i + w_j}{2} </math><br />
<br />
<math>\ c_{new} = \frac{1}{w_i \sigma^n_i + w_j \sigma^n_j}(w_i \sigma^n_i c_i + w_j \sigma^n_j c_j)</math><br />
<br />
<math>\ \sigma^2_{new} = \left(\frac{\sigma_i + \sigma_j}{2}+ \frac{min(||m-c_i||,||m-c_j||)}{2}\right)^2</math><br />
<br />
where <math>\ n </math> denotes the input dimension and <math>\ m </math> denotes the total number of radial basis functions.<br />
<br />
This process is repeated until the cross-correlation between the basis functions falls below a certain threshold, which is a tunable parameter. <br />
<br />
Note 1) Though not extensively discussed in <ref name="Eickhoff" />, this approach to RBF Network complexity control presumably requires a starting RBF Network with a large number basis functions.<br />
<br />
Note 2) This approach does not require the repeated implementation of differently sized RBF Networks to determine the empirical error, unlike the approach using SURE. However, the SURE approach is backed up by theory to find the number of radial basis functions that optimizes the true error and does not rely on some tunable threshold. It would be interesting to compare the results of both approaches (in terms of the resulting RBF Network obtained and the test error).<br />
<br />
<br />
===Generalized SURE for Exponential Families===<br />
As we know, Stein’s unbiased risk estimate (SURE) is limited to be applied for the independent, identically distributed (i.i.d.) Gaussian model. However, in some recent work, some researchers tried to work on obtaining a SURE counterpart for general, instead of deriving estimate by dominating least-squares estimation, and this technique made SURE extend its application to a wider area. <br />
<br />
You may look at Yonina C. Eldar, Generalized SURE for Exponential Families: Applications to Regularization, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 2, FEBRUARY 2009 for more information.<br />
<br />
===Further Reading===<br />
Fully Tuned Radial Basis Function Neural Networks for Flight Control<br />
<ref><br />
http://www.springer.com/physics/complexity/book/978-0-7923-7518-0;jsessionid=985F21372AC7AE1B654F1EADD11B296F.node3<br />
</ref><br />
<br />
Paper about the BBFN for multi-task learning <ref>http://books.nips.cc/papers/files/nips18/NIPS2005_0628.pdf</ref><br />
<br />
Radial Basis Function (RBF) Networks <ref>http://documents.wolfram.com/applications/neuralnetworks/index6.html</ref> <br />
<br />
An Example of RBF Networks <ref>http://reference.wolfram.com/applications/neuralnetworks/ApplicationExamples/12.1.2.html</ref><br />
<br />
This paper suggests an objective approach in determining proper samples to find good RBF networks with respect to accuracy <ref>http://www.wseas.us/e-library/conferences/2009/hangzhou/MUSP/MUSP41.pdf</ref>.<br />
<br />
== Support Vector Machines (Lecture: Oct. 27, 2011) ==<br />
<br />
[[Image:SVM.png|right|thumb|A series of linear classifiers, H2 represents a SVM, where the SVM attempts to maximize the margin, the distance between the closest point in each data set and the linear classifier.]]<br />
<br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support vector machines] (SVMs), also referred to as max-margin classifiers, are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. SVMs are kernel machines based on the principle of structural risk minimization, which are used in applications of regression and classification; however, they are mostly used as binary classifiers. Although the subject can be said to have started in the late seventies (Vapnik, 1979), it is receiving increasing attention recently by researchers. It is such a powerful method that in the few years since its introduction has outperformed most other systems in a wide variety of applications, especially in pattern recognition.<br />
<br />
The current standard incarnation of SVM is known as "soft margin" and was proposed by Corinna Cortes and Vladimir Vapnik [http://en.wikipedia.org/wiki/Vladimir_Vapnik]. In practice the data is not usually linearly separable. Although theoretically we can make the data linearly separable by mapping it into higher dimensions, the issues of how to obtain the mapping and how to avoid overfitting are still of concern. A more practical approach to classifying non-linearly separable data is to add some error tolerance to the separating hyperplane between the two classes, meaning that a data point in class A can cross the separating hyperplane into class B by a certain specified distance. This more generalized version of SVM is the so-called "soft margin" support vector machine and is generally accepted as the standard form of SVM over the hard margin case in practice today. [http://en.wikipedia.org/wiki/Support_vector_machine#Soft_margin]<br />
<br />
Support Vector Machines are motivated by the idea of training linear machines with margins. It involves preprocessing the data to represent patterns in a high dimension (generally much higher than the original feature space). Note that using a suitable non-linear mapping to a sufficiently high dimensional space, the data will always be separable. (p. 263) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
A suitable way to describe the interest in SVM can be seen in the following quote. "The problem which drove the initial development of SVMs occurs in several guises - the bias variance tradeoff (Geman, Bienenstock and Doursat, 1992), capacity control (Guyon et al., 1992), overfitting (Montgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking, for a given learning task, with a given finite amount of training data, the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set, and the “capacity” of the machine, that is, the ability of the machine to learn any training set without error. A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well. The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning (Vapnik, 1979). [http://research.microsoft.com/pubs/67119/svmtutorial.pdf A Tutorial on Support Vector Machines for Pattern Recognition]<br />
<br />
===== Support Vector Method Solving Real-world Problems=====<br />
<br />
No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. <br />
<br />
SVM can also be considered a special case of [http://en.wikipedia.org/wiki/Tikhonov_regularization Tikhonov regularization]. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). <br />
<br />
Another key feature of SVM, as discussed below, is the use of [http://en.wikipedia.org/wiki/Slack_variable slack variables] to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data. SVM has been used successfully in many real-world problems:<br />
<br />
- Pattern Recognition, such as Face Detection , Face Verification, Object Recognition, Handwritten Character/Digit Recognition, Speaker/Speech Recognition, Image Retrieval , Prediction;<br />
<br />
- Text and Hypertext categorization;<br />
<br />
- Image classification;<br />
<br />
- Bioinformatics, such as Protein classification, Cancer classification;<br />
<br />
Please refer to [http://www.clopinet.com/isabelle/Projects/SVM/applist.html here] for more applications.<br />
<br />
===== Structural Risk Minimization and VC Dimension =====<br />
<br />
Linear learning machines are the fundamental formulations of SVMs. The objective of the linear learning machine is to find the linear function that minimizes the generalization error from a set of functions which can approximate the underlying mapping between the input and output data. Consider a learning machine that implements linear functions in the plane as decision rules<br />
<br />
<math>f(\mathbf{x},\boldsymbol{\beta}, \beta_0)=sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)</math><br />
<br />
<br />
With ''n'' given training data with input values <math>\mathbf{x}_i \in \mathbb{R}^d</math> and output values <math>y_i\in\{-1,+1\}</math>. The empirical error is defined as<br />
<br />
<math>\Re_{emp} (\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n |y_i-f(\mathbf{x},\boldsymbol{\beta}, \beta_0)|= \frac{1}{n}\sum_{i=1}^n |y_i-sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)|</math><br />
<br />
<br />
where <math>\boldsymbol{\theta}=(\mathbf{x},\boldsymbol{\beta})</math><br />
<br />
The generalization error can be expressed as<br />
<br />
<math> \Re (\boldsymbol{\theta}) = \int|y-f(\mathbf{x},\boldsymbol{\theta})|p(\mathbf{x},y)dxdy</math><br />
<br />
which measures the error for all input/output patterns that are generated from the underlying generator of the data characterized by the probability distribution <math>p(\mathbf{x},y)</math> which is considered to be unknown.<br />
According to statistical learning theory, the generalization (test) error can be upper bounded in terms of training error and a confidence term as shown in<br />
<br />
<math>\Re (\boldsymbol{\theta})\leq \Re_{emp} (\boldsymbol{\theta}) +\sqrt{\frac{h(ln(2n/h)+1)-ln(\eta/4)}{n}}</math><br />
<br />
<br />
The term on left side represents generalization error. The first term on right hand side is empirical error calculated from the training data and the second term is called ''VC confidence'' which is associated with the ''VC dimension'' h of the learning machine. [http://en.wikipedia.org/wiki/Vc_dimension VC dimension] is used to describe the complexity of the learning system. The relationship between these three items is illustrated in figure below:<br />
<br />
<br />
[[File:risk.png|400px|thumb|centre| The relation between expected risk, empirical risk and VC confidence in SVMs.]]<br />
<br />
<br />
Thus, even though we don’t know the underlying distribution based on which the data points are generated, it is possible to minimize the upper bound of the generalization error in place of minimizing the generalization error. That means one can minimize the expression in the right hand side of the inequality above.<br />
<br />
Unlike the principle of Empirical Risk Minimization (ERM) applied in Neural Networks which aims to minimize the training error, SVMs implement Structural Risk Minimization (SRM) in their formulations. SRM principle takes both the training error and the complexity of the model into account and intends to find the minimum of the sum of these two terms as a trade-off solution (as shown in figure above) by searching a nested set of functions of increasing complexity.<br />
<br />
=====Introduction=====<br />
<br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine]is a popular linear classifier. Suppose that we have a data set with two classes which could be separated using a hyper-plane. Support Vector Machine (SVM) is a method which will give us the "best" hyper-plane. There are other classifiers that find a hyper-plane that separate the data, namely Perceptron. However, the output of Perceptron and many other algorithms depends on the input parameters, so every run of Percetron can give you a different output. On the other hand, SVM tries to find the hyper-plane that separates the data and have the farthest distance from the points. This is also known as the Max-Margin hyper-plane.<br />
<br />
No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. SVM can also be considered a special case of [http://en.wikipedia.org/wiki/Tikhonov_regularization Tikhonov regularization]. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). Another key feature of SVM, as discussed below, is the use of [http://en.wikipedia.org/wiki/Slack_variable slack variables] to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data.<br />
<br />
<gallery><br />
Image:KwebsterIntroDiagram.png|Infinitely many Perceptron solutions<br />
Image:CorrectChoice.png|Out of many how do we choose?<br />
</gallery><br />
<br />
<br />
With Perceptron, there can be infinitely many separating hyperplanes such that the training error will be zero. But the question is that among all these possible solution which one is the best. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. This makes sense because at test time, more points will be observed and they may be closer to the other class, so the safest choice for the hyper-plane would be the one farthest from both classes.<br />
<br />
One of the great things about SVM is that not only it has solid theoretical guarantees, but also it works very well in practice. <br />
<br />
'''To summarize'''<br />
<br />
[[Image:Margin.png|right|thumb|What we mean by margin is the distance between the hyperplane and the closest point in a class.]]<br />
<br />
If the data is Linearly separable, then there exists infinitely many solution hyperplanes. Of those, infinitely many hyperplanes, one of them is the best choice for the solution. Then the best decision to make is the hyperplane which is furthest from both classes. Our goal is to find a hyperplane among all possible hyperplanes which is furthest from both classes. This is to say, find the hyperplane that has maximum margin. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.<br />
<br />
What we mean by margin is the distance between the hyperplane and the closest point in a class.<br />
<br />
<!--<br />
If the mean value were to be used instead of the closest point, then an outlier may pull the hyperplane into the data which would incorrectly classify the known data points<br />
<gallery><br />
Image:NotMean.png|This is the reason why we use the closest point instead of the expected value.<br />
</gallery><br />
--><br />
[[Image:NotMean.png|right|thumb|If the mean value were to be used instead of the closest point, then an outlier may pull the hyperplane into the data which would incorrectly classify the known data points. This is the reason why we use the closest point instead of the expected value.]]<br />
<br />
===== Setting=====<br />
<br />
[[Image:Thedis.png|right|thumb|What is <math> d_i </math>]]<br />
<br />
* We assume that the data is linearly separable<br />
* Our classifier will be of the form <math> \boldsymbol\beta^T\mathbf{x} + \beta_0 </math><br />
* We will assume that our labels are <math> y_i \in \{-1,1\} </math><br />
<br />
<br />
<br />
The goal is to classify the point <math> \mathbf{x_i} </math> based on the <math>sign \{d_i\}</math> where <math>d_i</math> is the signed distance between <math> \mathbf{x_i}</math> and the hyperplane.<br />
<br />
<!-- Comments --><br />
<!--<br />
<gallery><br />
Image:Thedis.png|What is <math> d_i </math><br />
</gallery><br />
--><br />
<br />
Now we are going to check how far this point is from the hyperplane, and the parts on one side of the hyperplane will have a negative value and the parts on the other side will have a positive value. Points are classified by the sign of the data point. So <math>\mathbf{x_i}</math> would be classified using <math>d_i</math><br />
<br />
===Side Note: A memory from the past of Dr. Ali Ghodsi===<br />
When the aforementioned Professor was a small child, grade 2. He was often careless with the accuracy of certain curly brackets, when writing what one can only assume was math proofs. One day, his teacher grew impatient and demanded that a page of perfect curly brackets be produced by the young Dr. (He may or may not have been a doctor at the time) And now, whenever Dr. Ghodsi writes a tidy curly bracket, he is reminded of this and it always brings a smile to his face. <br />
<br />
From memories of the past.<br />
<br />
(the number 20 was involved in the story, either the number of pages or the number of lines)<br />
<br />
===== Case 1: Linearly Separable (Hard Margin) =====<br />
<br />
In this case, the classifier will be <math>\boldsymbol {\beta^T} \boldsymbol {x} + \beta_0 </math> and <math>\ y \in \{-1, 1\} </math>.<br />
The point <math>\boldsymbol {x_i}</math> to classify is based on the sign of <math>\ \{d_i\}</math>, where <math>\ d_i </math> is the signed distance between <math>\boldsymbol {x_i}</math> and the hyperplane.<br />
<br />
===== Objective Function =====<br />
[[Image:X1X2perpBeta.png|right|thumb|Look at it being perpendicular]]<br />
'''Observation 1:''' <math>\boldsymbol\beta</math> is orthogonal to hyper-plane. Because, for any two arbitrary points <math>\mathbf{x_1, x_2}</math> on the plane we have:<br />
<br />
<math> \boldsymbol\beta^T\mathbf{x_1} + \beta_0 = 0 </math><br />
<br />
<math> \boldsymbol\beta^T\mathbf{x_2} + \beta_0 = 0 </math><br />
<br />
So <math>\boldsymbol\beta^T (\boldsymbol{x_1}-\boldsymbol{x_2}) = 0</math>. Thus, <math> \boldsymbol\beta \perp (\boldsymbol{x_1} - \boldsymbol{x_2}) </math>, which implies that <math>\boldsymbol \beta</math> is a normal vector to the hyper-plane.<br />
<br />
<br />
'''Observation 2:''' If <math>\boldsymbol x_0</math> is a point on the hyper-plane, then there exists a <math>\ \beta_0 </math> such that, <math>\boldsymbol\beta^T\boldsymbol{x_0}+\beta_0 = 0</math>. So <math>\boldsymbol\beta^T\boldsymbol{x_0} = - \beta_0</math>. This along with observation 1 imply there exists a <math>\ \beta_0 </math> such that, <math>\boldsymbol\beta^T\boldsymbol{x} = - \beta_0</math> for all <math> \boldsymbol{x} </math> on the hyperplane.<br />
<br />
<br />
'''Observation 3:''' Let <math>\ d_i</math> be the signed distance of point <math>\boldsymbol{x_i}</math> from the plane. The <math>\ d_i</math> is the projection of <math>(\boldsymbol{x_i} - \boldsymbol{x_0})</math> on the direction of <math>\boldsymbol\beta</math>. In other words, <math> d_i \propto \boldsymbol\beta^T(\mathbf{x - x_0}) </math>.(normalize <math>\beta</math>)<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle d_i &= \frac{\boldsymbol\beta^T(\boldsymbol{x_i} - \boldsymbol{x_0})}{\vert \boldsymbol\beta\vert}\\ <br />
& = \frac{\boldsymbol{\beta^Tx_i}- \boldsymbol{\beta^Tx_0}}{\vert \boldsymbol\beta\vert}\\<br />
& = \frac{\boldsymbol{\beta^Tx_i}+ \beta_0}{\vert \boldsymbol\beta\vert}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Observation 4:''' Let margin be the distance between the hyper-plane and the closest point. Since <math> d_i </math> is the signed distance between the hyperplane and point <math>\boldsymbol{x_i} </math>, we can define the positive distance of point <math>\boldsymbol{x_i} </math> from the hyper-plane as <math>(y_id_i)</math>.<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \text{Margin} &= \min\{y_i d_i\}\\<br />
&= \min\{ \frac{y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0)}{|\boldsymbol\beta|} \}<br />
\end{align}<br />
</math><br />
<br />
Our goal is to maximize the margin. This is also known as the Max/Min problem in Optimization. When defining the hyperplane, what is important is the direction of <math>\boldsymbol\beta</math>. Value of <math>\beta_0</math> does not change the direction of the hyper-plane, it is only the distance from the origin. Note that if we assume that the points do not lie on the hyper-plane, then the margin is positive:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle &y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 0 &&\\<br />
&y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq C &&\mbox{ for some positive C } \\<br />
&y_i(\frac{\boldsymbol\beta^T}{C}\mathbf{x_i} + \frac{\beta_0}{C}) \geq 1 &&\mbox{ Divide by C}\\<br />
&y_i(\boldsymbol\beta^{*T}\mathbf{x_i} + \beta^*_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta^* = \frac{\boldsymbol\beta}{C}, \boldsymbol\beta_0^* = \frac{\boldsymbol\beta_0}{C}\\<br />
&y_i(\boldsymbol\beta^{T}\mathbf{x_i} + \beta_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta\gets\boldsymbol\beta^*, \boldsymbol\beta_0\gets\boldsymbol\beta_0^*\\<br />
\end{align}<br />
</math><br />
<br />
<br />
So with a bit of abuse of notation we can assume that<br />
<br />
<math> y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 1 </math><br />
<br />
Therefore, the problem translates to:<br />
: <math>\, \max\{\frac{1}{||\boldsymbol\beta||}\}</math><br />
<br />
So, it is possible to re-interpret the problem as:<br />
<br />
: <math>\, \min \frac 12 \vert \boldsymbol\beta \vert^2 \quad</math> s.t. <math>\quad \,y_i (\boldsymbol\beta^{T} \boldsymbol{x_i}+ \beta_0) \geq 1 </math><br />
<br />
<math>\, \vert \boldsymbol\beta \vert </math> could be any norm, but for simplicity we use L2 norm. We use <math>\frac 12 \vert \boldsymbol\beta \vert^2</math> instead of <math>|\boldsymbol\beta|</math> to make the function differentiable. To solve the above optimization problem we can use '''Lagrange multipliers''' as follows<br />
<br />
=====Support Vectors=====<br />
<br />
Support vectors are the training points that determine the optimal separating hyperplane that we seek. Also, they are the most difficult points to classify and at the same time the most informative for classification.<br />
<br />
=====Visualizing the Cost Function=====<br />
Recall the cost function for a single example in the logistic regression model:<br />
<br />
<math>-\left( y \log \frac{1}{1+e^{-\beta^T \boldsymbol{x}}} + (1-y)\log \frac{e^{-\beta^T\boldsymbol{x}}}{1+e^{-\beta^T \boldsymbol{x}}} \right)</math><br />
<br />
where <math>y \in \{0,1\}</math>. Looking at the plot of the cost term (for y=1), if <math>y=1</math> (i.e. the target class is 1), then we want our <math>\beta</math> to be such that <math>\beta^T \boldsymbol{x} \gg 0</math>. This will ensure very accurate classification.<br />
<br />
[[Image:logreg_cost.jpg|450px]]<br />
<br />
Now for SVM, consider the generic cost function as follows:<br />
<br />
<math>-\left( y \cdot \text{cost}_1(\beta^T \boldsymbol{x}) + (1-y)\cdot \text{cost}_0(\beta^T \boldsymbol{x}) \right)</math><br />
<br />
We can visualize <math>\text{cost}_1</math> compared with the sigmoid cost term in logistic regression as follows:<br />
<br />
[[Image:svm_cost.jpg|450px]]<br />
<br />
What you should take away from this is for y=1, we want <math>\beta^T \boldsymbol{x}\ge 1</math>. In our notes, we have <math>y \in \{-1, 1\}</math>, so that's why we write <math>y_i (\beta^T \boldsymbol{x} + \beta_0) \ge 1</math>.<br />
<br />
The same rationale can be applied for y=0, using <math>(1-y)\log \frac{1}{1+e^{-\beta^T \boldsymbol{x}}}</math><br />
<br />
=====Writing Lagrangian Form of Support Vector Machine =====<br />
<br />
The Lagrangian form using [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange multipliers] and constraints that are discussed below is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution (the optimal saddle point of the Lagrangian for the [http://en.wikipedia.org/wiki/Quadratic_programming classic quadratic optimization]). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
the Lagrangian function of the above optimization problem:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \vert \boldsymbol\beta \vert^2 - \sum_{i=1}^n \alpha_i \left[ y_i (\boldsymbol{\beta^T x_i}+\beta_0) -1 \right]\\<br />
&= \frac 12 \vert \boldsymbol\beta \vert^2 - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i<br />
\end{align}<br />
</math><br />
<br />
where <math>\boldsymbol\alpha = (\alpha_1 ,... ,\alpha_n) </math> are lagrange multipliers. <math> 0 \le \alpha_{i} i=1...n </math><br />
<br />
To find the optimal value, we set the derivatives equal to zero: <math>\,\frac{\partial L}{\partial \boldsymbol{\beta}} = 0</math> and <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>.<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle &\frac{\partial L}{\partial \boldsymbol{\beta}} = \boldsymbol\beta - \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} = 0 &\Longrightarrow& \boldsymbol\beta = \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}\\<br />
&\frac{\partial L}{\partial \beta_0} = - \sum_{i=1}^n \alpha_i y_i = 0 &\Longrightarrow& \sum_{i=1}^n \alpha_i y_i = 0 <br />
\end{align}<br />
</math><br />
<br />
To get the dual form of the optimization problem we replace the above two equations in definition of <math>L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha)</math>. <br />
<br />
We have:<br />
<math><br />
\begin{align}<br />
\displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \boldsymbol\beta^T\boldsymbol\beta - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i\\<br />
&= \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - 0 + \sum_{i=1}^n \alpha_i\\<br />
&= - \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\<br />
&= - \frac 12 \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\<br />
&= \sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j\boldsymbol{x_i}^T\boldsymbol{x_j}<br />
\end{align}<br />
</math><br />
<br />
The above function is a dual objective function, so we should minimize it:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \max_\alpha &\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j}\\<br />
s.t.\; & \alpha_i \geq 0\\<br />
& \sum_{i=1}^n \alpha_i y_i = 0<br />
\end{align}<br />
</math><br />
<br />
The dual function is a quadratic function of several variables subject to linear constraints. This optimization problem is called Quadratic Programming and is much easier than the primal function. It is possible to to write to dual form using matrices:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \max_\alpha \,& \boldsymbol\alpha^T\boldsymbol{1} - \frac 12 \boldsymbol\alpha^T S \boldsymbol\alpha\\<br />
s.t.\; & \boldsymbol\alpha \geq 0\\<br />
& \boldsymbol\alpha^Ty = 0\\<br />
& S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X)<br />
\end{align}<br />
</math><br />
<br />
<br />
Since <math> S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X) </math>, S is a positive semi-definite matrix. This means that the dual function is convex.[http://en.wikipedia.org/wiki/Convex_function]. This means that the dual function does not have any local minimum that is not global. So it is relatively easy to find the global minimum.<br />
<br />
This is a much simpler optimization problem and we can solve it by [http://en.wikipedia.org/wiki/Quadratic_programming Quadratic programming]. Quadratic programming (QP) is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables.<br />
The general form of such a problem is minimize with respect to <math>\,x</math><br />
: <math>f(x) = \frac{1}{2}x^TQx + c^Tx</math><br />
subject to one or more constraints of the form:<br />
<br />
<math>\,Ax\le b</math>, <math>\,Ex=d</math>.<br />
<br />
A good description of general QP problem formulation and solution can be find [http://www.me.utexas.edu/~jensen/ORMM/supplements/methods/nlpmethod/S2_quadratic.pdf link here].<br />
<br />
===== Discussion on the Dual of the Lagrangian =====<br />
As mentioned in the previous section, solving the dual form of the Lagrangian requires quadratic programming. Quadratic programming can be used to minimize a quadratic function subject to a set of constraints. In general, for a problem with N variables, the quadratic programming solution has a computational complexity of <math>\ O(N^3) </math> <br />
<ref name="CMBishop" />. The original problem formulation only has (d+1) variables that need to be found (i.e. the values of <math>\ \beta </math> and <math>\ \beta_0 </math>), where d is the dimensionality of the data points. However, the dual form of the Lagrangian has n variables that need to be found (i.e. all the <math>\ \alpha </math> values), where n is the number of data points. It is likely that n is larger than (d+1) (i.e. the number of data points is larger than the dimensionality of the data plus 1), which makes the dual form of the Lagrangian seem computationally inefficient <ref name="CMBishop" />. However, the dual of the Lagrangian allows the inner product <math>\ x_i^T x_j </math> to be expressed using a kernel formulation which allows the data to be transformed into higher feature spaces and thus allowing seemingly non-linearly separable data points to be separated, which is a highly useful feature described in more detail in the next class <ref name="CMBishop" />.<br />
<br />
===== Support Vector Method Packages=====<br />
<br />
One of the popular Matlab toolboxes for SVM is [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM], which has been developed in the department of Computer Science and Information Engineering, National Taiwan University, under supervision of Chih-Chung Chang and Chih-Jen Lin. In this page they have provided the society with many different interfaces for LIBSVM like Matlab, C++, Python, Perl, and many other languages, each one of those has been developed in different institutes and by variety of engineers and mathematicians. In this page you can also find a thorough introduction to the package and its various parameters.<br />
<br />
A very helpful tool which you can find on the [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM] page is a graphical interface for SVM; it is an applet by which we can draw points corresponding to each of the two classes of the classification problem and by adjusting the SVM parameters, observe the resulting solution.<br />
<br />
If you found LIBSVM helpful and wanted to use it for your research, [http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f203 please cite the toolbox].<br />
<br />
A pretty long list of other SVM packages and comparison between all of them in terms of language, execution platform, multiclass and regression capabilities, can be found [http://www.cs.ubc.ca/~murphyk/Software/svm.htm here].<br />
<br />
The top 3 SVM software are:<br />
<br />
1. LIBSVM<br />
<br />
2. SVMlight<br />
<br />
3. SVMTorch<br />
<br />
More information which introduces SVM software and their comparison can be found [http://www.svms.org/software.html here] and [http://www.support-vector-machines.org/SVM_soft.html here].<br />
<br />
== Support Vector Machine Continued (Lecture: Nov. 1, 2011) ==<br />
<br />
In the previous lecture we considered the case when data is linearly separable. The goal of the Support Vector Machine classifier is to find the hyperplane that maximizes the margin distance from the hyperplane to each of the two classes. We derived the following optimization problem based on the SVM methodology:<br />
<br />
<math>\, \min_{\beta} \frac{1}{2}{|\boldsymbol{\beta}|}^2</math><br />
<br />
Subject to the constraint: <br />
<br />
<math>\,y_i(\boldsymbol{\beta}^T\mathbf{x}_i+\beta_0)\geq1, \quad y_i \in \{-1,1\} \quad \forall{i} =1, \ldots , n</math><br /><br />
<br />
Notice that SVM can only classify 2-class output. Lots of work will be needed for higher classes output. <br />
<br />
This is the primal form of the optimization problem. Then we derived the dual of this problem:<br />
<br />
<math>\, \max_\alpha \quad \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j </math><br />
<br />
Subject to constraints: <br />
<br />
<math>\,\alpha_i\geq 0 </math><br />
<br />
<math>\,\sum_i \alpha_i y_i =0</math><br />
<br />
<br />
The is a quadratic programming problem. QP problems have been thoroughly studied and they can be efficiently solved. This particular problem has a convex objective function as well as convex constraints. This guarantees a global optima, even if we use local optima search algorithms (e.g. gradient descent). These properties are of significant importance for classifiers and thus are one of the most important strengths of the SVM classifier. <br />
<br />
for an easy implementation of SVM and solving above quadratic optimization problem in R see<ref><br />
http://cbio.ensmp.fr/~thocking/mines-course/2011-04-01-svm/svm-qp.pdf<br />
</ref><br />
<br />
We are able to find <math>\,\beta</math> when <math>\,\alpha</math> is found: <br />
<br />
<math>\, \boldsymbol{\beta} = \sum_i \alpha_i y_i \mathbf{x}_i </math><br />
<br />
But in order to find the hyper-plane uniquely we also need to find <math>\,\beta_0</math>. <br />
<br />
When finding the dual objective function, there is a set of conditions called '''KKT''' that should be satisfied.<br />
<br />
=== Examining KKT Conditions ===<br />
KKT stands for [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker] (initially named after Kuhn and Tucker's work in the 1950's, however, it was later discovered that Karush had stated the conditions back in the late 1930's) <ref name="CMBishop" /><br />
<br />
The K.K.T. conditions are as follows: stationarity, primal feasibility, dual feasibility, and complementary slackness.<br />
<br />
It gives us a closer look into the Lagrangian equation and the associated conditions. <br />
<br />
Suppose we want to find <math>\, \min_x f(x)</math> subject to the constraint <math>\, g_i(x)\geq 0 , \forall{x} </math>. The Lagrangian is then computed as:<br />
<br />
<math>\, \mathcal{L} (x,\alpha_i)=f(x)-\sum_i \alpha_i g_i(x) </math><br />
<br />
If <math> \, x^* </math> is the point where <math>\beta</math> is optimal with respect to our cost function, the necessary conditions for <math> \, x^* </math> to be the local minimum :<br />
<br />
1) '''Stationarity''': <math>\, \frac{\partial \mathcal{L}}{\partial x} (x^*) = 0 </math> that is <math>\, f'(x^*) - \Sigma_i{\alpha_ig'_i(x^*)}=0</math><br />
<br />
2) '''Dual Feasibility''': <math>\, \alpha_i\geq 0 , </math><br />
<br />
3) '''Complementary Slackness''': <math>\, \alpha_i g_i(x^*)=0 , </math><br />
<br />
4) '''Primal Feasibility''': <math>\, g_i(x^*)\geq 0 , </math><br />
<br />
<br />
If any of the above four conditions are not satisfied, then the primal function is not feasible. <br />
<br />
=====Support Vectors=====<br />
Support vectors are the training points that determine the optimal separating hyperplane that we seek i.e. the margin is calculated as the distance from the hyperplane to the support vectors. Also, they are the most difficult points to classify and at the same time the most informative for classification.<br />
<br />
In our case, the <math>g_i({x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance unit away from the hyperplane, since <math>y_i(\beta^T \boldsymbol{x_i} + \beta_0)</math> is the value of the projected distance in the specific direction of the target value.<br />
<br />
'''Case 1: a point away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
In other words, if point <math>\, x_i</math> is not on the margin (i.e. <math>\boldsymbol{x_i}</math> is not a support vector), then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point on the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) = 1 \Rightarrow \alpha_i > 0 </math>.<br />
<br\>If point <math>\, x_i</math> is on the margin (i.e. <math>\boldsymbol{x_i}</math> is a support vector), then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
Since it is impossible for us to know a priori which of the training data points would end up as the support vectors, it is necessary for us to work with the entire training set to find the optimal hyperplane. It is usually the case that we only use a small number of support vectors, which makes the SVM model very robust to new data.<br />
<br />
<br />
To compute <math>\ \beta_0</math>, we need to choose any <math>\,\alpha_i > 0</math>, this will satisfy:<br />
<br />
<math>\,y_i(\beta^Tx_i+\beta_0) = 1</math>.<br />
<br />
We can compute <math>\,\beta = \sum_i \alpha_i y_i x_i </math>, substitute <math>\ \beta</math> in <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math> and solve for <math>\ \beta_0</math>.<br />
<br />
Everything we derived so far was based on the assumption that the data is linearly separable (termed '''Hard Margin SVM'''), but there are many cases in practical applications that the data is not linearly separable.<br />
<br />
=== Kernel Trick ===<br />
<br />
[[File:Kerneltrick.JPG|500px|thumb|right|An example of mapping 2D space into 3D such that the inseparable red o's and the blue +'s in 2D space can be separated when mapped into 3D space <ref><br />
Jordan (2004). ''The Kernel Trick.'' [Lecture]. Available: [http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf.]|</ref>]]<br />
<br />
We talked about the curse of dimensionality at the beginning of this course. However, we now turn to the power of high dimensions in order to find a hyperplane between two classes of data points that can linearly separate the transformed (mapped) data in a space that has a higher dimension than the space in which the training data points reside. <br />
<br />
To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the [http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf kernel trick] is basically to map data to a higher dimension in which the mapped data are linearly separable by a hyperplane, even if the original data are not linearly separable.<br />
<br />
The original optimal hyperplane algorithm proposed by [http://en.wikipedia.org/wiki/Vladimir_Vapnik Vladimir Vapnik] in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The algorithm is very similar, except that every dot product is replaced by a non-linear kernel function as below. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. We have seen SVM as a linear classification problem that finds the maximum margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised in order to solve the same linear classification problem but in a higher dimensional space, a [http://en.wikipedia.org/wiki/Feature_space feature space], under which the maximum margin hyperplane is better suited.<br />
<br />
In Machine Learning, the kernel trick is a way of mapping points into an inner product space, hoping that the new space is more suitable for classification. <br />
<math>\phi</math> is function to transfer a m-dimensional data to a higher dimension, so that we can find the connection between the non-linearly separable data the linearly separable ones.<br />
Example:<br />
<br />
<math> \left[\begin{matrix}<br />
\,x \\<br />
\,y \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x^2 \\<br />
\,y^2 \\<br />
\, \sqrt{2}xy \\<br />
\end{matrix}\right]</math><br />
<br />
<math>k(x,y)=\phi^{T}(x)\phi(y)</math><br />
<br />
<math> \left[\begin{matrix}<br />
\,x_1 \\<br />
\,y_1 \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x_1^2 \\<br />
\,y_1^2 \\<br />
\, \sqrt{2}x_1y_1 \\<br />
\end{matrix}\right]</math><br />
<br />
<math> \left[\begin{matrix}<br />
\,x_2 \\<br />
\,y_2 \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x_2^2 \\<br />
\,y_2^2 \\<br />
\, \sqrt{2}x_2y_2 \\<br />
\end{matrix}\right]</math><br />
<br />
<br />
<br />
<math> \left[\begin{matrix}<br />
\,x_1^2 \\<br />
\,y_1^2 \\<br />
\, \sqrt{2}x_1y_1 \\<br />
\end{matrix}\right] ^{T} * \left[\begin{matrix}<br />
\,x_2^2 \\<br />
\,y_2^2 \\<br />
\, \sqrt{2}x_2y_2 \\<br />
\end{matrix}\right] = K(\left[\begin{matrix}<br />
\,x_1 \\<br />
\,y_1 \\<br />
\end{matrix}\right],\left[\begin{matrix}<br />
\,x_2 \\<br />
\,y_2 \\<br />
\end{matrix}\right] ) </math><br />
<br />
Recall our objective function: <math>\sum_i \alpha_i - \frac{1}{2} \sum_{ij} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j</math><br />
We can replace <math> \mathbf{x}_i^T\mathbf{x}_j </math> by <math> \mathbf{\phi^{T}(x_i)}\mathbf{\phi(x_j)}= k(x_i,x_j) </math><br />
<br />
<br />
<math> \left[\begin{matrix}<br />
\,k(x_1, x_1)& \,k(x_1, x_2)& \cdots &\,k(x_1, x_n) \\<br />
\vdots& \vdots& \vdots& \vdots\\<br />
\,k(x_n, x_1)& \,k(x_n, x_2)& \cdots &\,k(x_n, x_n) \\<br />
\end{matrix}\right] </math><br />
<br />
<br />
In most of the real world cases the data points are not linearly separable. How can the above methods be generalized to the case where the decision function is not a linear function of the data? Boser, Guyon and Vapnik, 1992, showed that a rather old trick (Aizerman, 1964) can be used to accomplish this in an astonishingly straightforward way. First notice that the only way in which the data appears in the dual-form optimization problem is in the form of dot products: <math>\mathbf{x}_i^T.\mathbf{x}_j</math> . Now suppose we first use a non-linear operator <math> \Phi \mathbf(x) </math> to map the data points to some other higher dimensional space (possibly infinite dimensional) <math> \mathcal{H} </math> (called Hilbert space or feature space), where they can be classified linearly. Figure below illustrates this concept:<br />
<br />
<br />
[[File:kernell trick.jpg|500px|thumb|centre|Mapping of not-linearly separable data points in a two-dimensional space to a three-dimensional space where they can be linearly separable by means of a kernel function.]]<br />
<br />
<br />
In other words, a linear learning machine can be employed in the higher dimensional feature space to solve the original non-linear problem. Then of course the training algorithm would only depend on the data through dot products in <math> \mathcal{H} </math>, i.e. on functions of the form <math><\Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)> </math>. Note that the actual mapping <math> \Phi \mathbf(x) </math> does not need to be known, only the inner product of the mapping is needed for modifying the support vector machine such that it can separate non-linearly separable data. Avoiding the actual mapping to the higher dimensional space is preferable, because higher dimensional spaces may have problems due to the ''curse of dimensionality''.<br />
<br />
So the hypothesis in this case would be<br />
<br />
<math>f(\mathbf{x}) = \boldsymbol{\beta}^T \Phi (\mathbf{x}) + \beta_0</math><br />
<br />
which is linear in terms of the new space that <math> \Phi (\mathbf{x}) </math> maps the data to, but non-linear in the original space. Now we can extend all the presented optimization problems for the linear case, for the transformed data in the feature space. If we define the kernel function as<br />
<br />
<math> K (\mathbf{x}_i,\mathbf{x}_j) = <\Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)> = \Phi(\mathbf{x}_i)^T \Phi (\mathbf{x}_j)</math><br />
<br />
where <math>\ \Phi </math> is a mapping from input space to an (inner product) feature space. Then the corresponding dual form is<br />
<br />
<br />
<math>L(\boldsymbol{\alpha}) =\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j K (\mathbf{x}_i,\mathbf{x}_j)</math><br />
<br />
subject to <math>\sum_{i=1}^n \alpha_i y_i=0 \quad \quad \alpha_i \geq 0,\quad i=1, \cdots, n</math><br />
<br />
<br />
The cost function <math> L(\boldsymbol{\alpha}) </math> is convex and quadratic in terms of the unknown parameters. This problem is solved through quadratic programming. The [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions KKT] conditions for this equation lead to the following final decision rule:<br />
<br />
<math> L(\mathbf{x}, \boldsymbol{\alpha}^{\ast}, \beta_0) =\sum_{i=1}^{N_{sv}} y_i \alpha_i^{\ast} K (\mathbf{x}_i,\mathbf{x}) + \beta_0</math><br />
<br />
<br />
where <math>\ N_{sv} </math> and <math>\ \alpha_i</math> denote number of support vectors and the non-zero Lagrange multipliers corresponding to the support vectors respectively. <br />
<br />
Several typical choices of kernels are linear, polynomial, Sigmoid or Multi-Layer Perceptron (MLP) and Gaussian or Radial Basis Function (RBF) kernel. Their expressions are as following:<br />
<br />
Linear kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = \mathbf{x}_i^T\mathbf{x}_j</math> <br />
<br />
Polynomial kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = (1 + \mathbf{x}_i^T\mathbf{x}_j)^p</math> <br />
<br />
Sigmoid (MLP) kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = \tanh (k_1\mathbf{x}_i^T\mathbf{x}_j +k_2)</math> <br />
<br />
Gaussian (RBF) kernel: <math>\ K(\mathbf{x}_i,\mathbf{x}_j) = \exp\left[\frac{-(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)}{2\sigma^2 }\right]</math> <br />
<br />
<br />
Kernel functions satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's conditions] not only enables implicit mapping of data from input space to feature space but also ensure the convexity of the cost function which leads to the unique optimum. Mercer condition states that a continuous symmetric function <math> K \mathbf(x,y) </math> must be positive semi-definite to be a kernel function which can be written as inner product between the data pairs. Note that we would only need to use K in the training algorithm, and would never need to explicitly even know what <math>\ \Phi </math> is. <br />
<br />
Furthermore, one can construct new kernels from previously defined kernels.[http://www.cc.gatech.edu/~ninamf/ML10/lect0309.pdf] Given two kernels <math>K_1 (\mathbf{x}_i,\mathbf{x}_j)</math> and <math>K_2 (\mathbf{x}_i,\mathbf{x}_j)</math>, properties include:<br />
<br />
1. <math>K (\mathbf{x}_i,\mathbf{x}_j) = \alpha K_1 (\mathbf{x}_i,\mathbf{x}_j) + \beta K_2 (\mathbf{x}_i,\mathbf{x}_j) </math> for <math> \alpha , \beta \geq 0 </math><br />
<br />
2. <math>K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (\mathbf{x}_i,\mathbf{x}_j) K_2 (\mathbf{x}_i,\mathbf{x}_j) </math> <br />
<br />
3. <math>K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (f ( \mathbf{x}_i ) ,f ( \mathbf{x}_j ) ) </math> where <math>\, f \colon X \rightarrow X </math><br />
<br />
4. <math>K (\mathbf{x}_i,\mathbf{x}_j) = f ( K_1 ( \mathbf{x}_i , \mathbf{x}_j ) </math> where <math>\, f </math> is a polynomial with positive coefficients.<br />
<br />
<br />
In the case of Gaussian or RBF kernel for example, <math> \mathcal{H} </math> is infinite dimensional, so it would not be very easy to work with <math> \Phi </math> explicitly. However, if one replaces <math> <(\mathbf{x}_i). (\mathbf{x}_j)> </math> by <math> K (\mathbf{x}_i,\mathbf{x}_j) </math> everywhere in the training algorithm, the algorithm will happily produce a support vector machine which lives in an infinite dimensional space, and furthermore do so in roughly the same amount of time it would take to train on the un-mapped data. All the considerations of the previous sections hold, since we are still doing a linear separation, but in a different space.<br />
<br />
<br />
The choice of which kernel would be best for a particular application has to be determined through trial and error. Normally, the Gaussian or RBF kernel are best suited for classification tasks including SVM.<br />
<br />
<br />
The video below shows a graphical illustration of how a polynomial kernel works to a get better sense of kernel concept:<br />
<br />
[http://www.youtube.com/watch?v=3liCbRZPrZA Mapping data points to a higher dimensional space using a polynomial kernel]<br />
<br />
====Kernel Properties====<br />
Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. The Gram matrix is the matrix whose elements are <math>\ g_{ij} = K(x_i,x_j) </math>. Kernels which are said to satisfy the Mercer's theorem are positive semi-definite, meaning their kernel matrices have no non-negative Eigen values. The use of a positive definite kernel ensures that the optimization problem will be convex and solution will be unique. <ref> Reference:http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_properties</ref><br />
<br />
<br />
Furthermore, kernels can be categorized into classes based on their properties <ref name="Genton"> M. G. Genton, "Classes of Kernels for Machine Learning: A Statistics Perspective," ''Journal of Machine Learning Research 2'', 2001</ref>:<br />
* ''Nonstationary kernels'' are explicitly dependent on both inputs (e.g., the polynomial kernel).<br />
* ''Stationary kernels'' are invariant to translation (e.g., the Gaussian kernel which only looks at the distance between the inputs).<br />
* ''Reducible kernels'' are nonstationary kernels that can be reduced to stationary kernels via a bijective deformation (for more detailed information see <ref name = "Genton" />).<br />
<br />
====Further Information of Kernel Functions====<br />
<br />
In class we have studied 3 kernel functions, linear, polynomial and gaussian kernel. The following are some properties for each:<br />
# '''Linear Kernel''' is the simplest kernel. Algorithms using this kernel are often equivalent to non-kernel algorithms such as standard PCA<br />
# '''Polynomial Kernel''' is a non-stationary kernel, well suited when training data is normalized.<br />
# '''Gaussian Kernel''' is an example of radial basis function kernel.<br />
<br />
When choosing a kernel we need to take into account the data we are trying to model. For example, data that clusters in circles (or hyperspheres) is better classified by Gaussian Kernel.<br />
<br />
Beyond the kernel functions we discussed in class, such as Linear Kernel, Polynomial Kernel and Gaussian Kernel functions, many more kernel functions can be used in the application of kernel methods for machine learning. <br />
<br />
Some examples are: Exponential Kernel, Laplacian Kernel, ANOVA Kernel, Hyperbolic Tangent (Sigmoid) Kernel, Rational Quadratic Kernel, Multiquadric Kernel, Inverse Multiquadric Kernel, Circular Kernel, Spherical Kernel, Wave Kernel, Power Kernel, Log Kernel, Spline Kernel, B-Spline Kernel, Bessel Kernel, Cauchy Kernel, Chi-Square Kernel, Histogram Intersection Kernel, Generalized Histogram Intersection Kernel, Generalized T-Student Kernel, Bayesian Kernel, Wavelet Kernel, etc. <br />
<br />
You may visit http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_functions for more information.<br />
<br />
=== Case 2: Linearly Non-Separable Data (Soft Margin) ===<br />
<br />
The original SVM was specifically made for separable data. But, this is a very strong requirement, so it was suggested by Vladimir Vapnik and Corinna Cortes later on to remove this requirement. This is called Soft Margin Support Vector Machine. One of the advantages of SVM is that it is relatively easy to generalize it to the case that the data is not linearly separable.<br />
<br />
In the case when 2 data sets are not linearly separable, it is impossible to have a hyperplane that completely separates 2 classes of data. In this case the idea is to minimize the number of points that cross the margin and are miss-classified .So we are going to minimize that are going to violate the constraint: <br />
<br />
<math>\, y_i(\beta^T x_i + \beta_0) \geq 1</math><br />
<br />
Hence we allow some of the points to cross the margin (or equivalently violate our constraint) but on the other hand we penalize our objective function (so that the violations of the original constraint remains low): <br />
<br />
<math>\, min (\frac{1}{2} |\beta|^2 +\gamma \sum_i \zeta_i) </math><br />
<br />
And now our constraint is as follows: <br />
<br />
<math>\, y_i(\beta^T x_i + \beta_0) \geq 1-\zeta_i</math><br />
<br />
<math>\, \zeta_i \geq 0</math><br />
<br />
We have to check that all '''KKT''' conditions are satisfied: <br />
<br />
<math>\, \mathcal{L}(\beta,\beta_0,\zeta_i,\alpha_i,\lambda_i)=\frac{1}{2}|\beta|^2+\gamma \sum_i \zeta_i -\sum_i \alpha_i[y_i(\beta^T x_i +\beta_0)-(1-\zeta_i)] - \sum_i \lambda_i \zeta_i</math><br />
<br />
<math>\, 1) \frac{\partial\mathcal{L}}{\partial \beta}=\beta-\sum_i \alpha_i y_i x_i \rightarrow \beta=\sum_i \alpha_i y_i x_i</math><br />
<br />
<math>\, 2) \frac{\partial\mathcal{L}}{\partial \beta_0}=\sum_i \alpha_i y_i =0</math><br />
<br />
<br />
<math>\, 3) \frac{\partial\mathcal{L}}{\partial \zeta_i}=\gamma - \alpha_i - \lambda_i </math><br />
<br />
Now we have to write this into a Lagrangian form.<br />
<br />
== Support Vector Machine Continued (Lecture: Nov. 3, 2011) ==<br />
<br />
=== Case 2: Linearly Non-Separable Data (Soft Margin [http://fourier.eng.hmc.edu/e161/lectures/svm/node5.html]) Continued ===<br />
<br />
Recall from last time that soft margins are used instead of hard margins when we are using SVM to classify data points that are '''not''' linearly separable. <br />
<br />
===== Soft Margin SVM Derivation of Dual =====<br />
<br />
The soft-margin SVM optimization problem is defined as:<br />
<br />
<math>\min \{\frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i\}</math> <br />
<br />
subject to the constraints<br />
<math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad ,\quad \zeta_i \ge 0</math>,<br />
<br />
where <math>\boldsymbol \gamma \sum_i \zeta_i \quad \quad</math> is the penalty function that penalizes the slack variable. Note that <math>\zeta_i=0</math> denotes the Hard Margin SVM classifier.<br />
<br />
(where <math>\zeta > 0 </math> represents some points across the margin). <br />
<br />
In other words, we have relaxed the constraint for each <math>\boldsymbol{x_i}</math> so that it can violate the margin by an amount <math>\zeta_i</math>.<br />
As such, we want to make sure that all <math>\zeta_i</math> values are as small as possible. So, we penalize them in the objective function by a factor of some chosen <math>\gamma</math>.<br />
<br />
=====Forming the Lagrangian=====<br />
<br />
In this case we have have two constraints in the Lagrangian primal form (<math>\beta</math> and <math>\zeta</math>) and therefore we optimize with respect to two dual variables <math>\, \alpha</math> and <math>\,\lambda</math>,<br />
<br />
<math><br />
L(\boldsymbol{\beta},\beta_0,\zeta_i,\alpha_i,\lambda_i) = \frac{1}{2} |\boldsymbol{\beta}|^2 + \gamma \sum_i \zeta_i - \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] - \sum_i \lambda_i \zeta_i<br />
</math> <br />
<br />
Note the following simplification:<br />
<br />
<math>- \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = -\boldsymbol{\beta}^T\sum_i\alpha_i y_i x_i-\beta_0\sum_i\alpha_iy_i+\sum_i\alpha_i-\sum_i\alpha_i\zeta_i</math><br />
<br />
=====Apply KKT conditions=====<br />
<br />
<math><br />
\begin{align}<br />
1) &\frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} = \boldsymbol{\beta}-\sum_i \alpha_i y_i \boldsymbol{x_i} = 0 \\<br />
& \rightarrow \boldsymbol{\beta} = \sum_i \alpha_i y_i \boldsymbol{x_i} \\<br />
&\frac{\partial \mathcal{L}}{\partial \beta_0} = \sum_i \alpha_i y_i = 0 \\<br />
&\frac{\partial \mathcal{L}}{\partial \zeta_i} = \gamma - \alpha_i - \lambda_i = 0 \\<br />
& \rightarrow \boldsymbol{\gamma} = \alpha_i + \lambda_i \\<br />
2) &\text{dual feasibility: } \alpha_i \ge 0, \lambda_i \ge 0 \\<br />
3) &\alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = 0, \text{ and } \lambda_i \zeta_i = 0 \\<br />
4) &y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad,\quad \zeta_i \ge 0 \\<br />
\end{align}<br />
</math><br />
<br />
=====Objective Function=====<br />
Simplifying the Lagrangian the same way we did with the hard margin case, we get the following:<br />
<br />
<math><br />
\begin{align}<br />
L &= \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \gamma \sum_i \zeta_i - \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} - \beta_0 \sum_i \alpha_i y_i + \sum_i \alpha_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i - 0 + (\sum_i \gamma \zeta_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i) \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i + \sum_i (\gamma - \alpha_i - \lambda_i) \zeta_i \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i<br />
\end{align}<br />
</math><br />
<br />
subject to the constaints:<br />
<br />
<math><br />
\begin{align}<br />
\alpha_i &\ge 0 \\<br />
\sum_i \alpha_i y_i &= 0 \\<br />
\lambda_i &\ge 0<br />
\end{align}<br />
</math><br />
<br />
Notice that the simplified Lagrangian is the exact same as the hard margin case. The only difference with the soft margin case is the additional constraint <math>\lambda_i \ge 0</math>. However, <math>\gamma</math> doesn't actually appear directly in the objective function. But, we can discern the following:<br />
<br />
<math>\lambda_i = 0 \implies \alpha_i = \gamma</math><br />
<br />
<math>\lambda_i > 0 \implies \alpha_i < \gamma</math><br />
<br />
Thus, we can derive that the only difference with the soft margin case is the constraint <math>0 \le \alpha_i \le \gamma</math>. This problem can be solved with quadratic programming.<br />
<br />
===== Soft Margin SVM Formulation Summary =====<br />
<br />
In summary, the primal form of the soft-margin SVM is given by:<br />
<br />
<math><br />
\begin{align}<br />
\min_{\boldsymbol{\beta}, \boldsymbol{\zeta}} \quad & \frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i \\<br />
\text{s.t. } & y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad, \quad \zeta_i \ge 0 \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
<br />
The corresponding dual form which we derived above is:<br />
<br />
<math><br />
\begin{align}<br />
\max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} \\<br />
\text{s.t. } & \sum_i \alpha_i y_i = 0 \\<br />
& 0 \le \alpha_i \le \gamma, \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
Note, the soft-margin dual objective is identical to hard margin dual objective! The only difference is now <math>\,\alpha_i</math> variables cannot be unbounded and are restricted to be a maximum of <math>\,\gamma</math>. This restriction allows the optimization problem to become feasible when the data is non-seperable. In the hard-margin case, when <math>\,\alpha_i</math> is unbounded there may be no finite maximum for the objective and we would not be able to converge to a solution. <br />
<br />
Also note, <math>\,\gamma</math> is a model parameter and must be chosen to a fixed constant. It controls the size of margin versus violations. In a data set with a lot of noise (or non-seperability) you may want to choose a smaller <math>\,\gamma</math> to ensure a large margin. In practice, <math>\,\gamma</math> is chosen by cross-validation---which tests the model on a held out sample to determine which <math>\,\gamma</math> gives the best result. However, it may be troublesome to work with <math>\,\gamma</math> since <math>\,\gamma \in (0, \infty)</math>. So often a variant formulation, known as <math>\,\nu</math>-SVM is used which uses a better scaled parameter <math>\,\nu \in (0,1)</math> instead of <math>\,\gamma</math> to balance margin versus separability. <br />
<br />
Finally note that as <math>\,\gamma \rightarrow \infty</math>, the soft-margin SVM converges to hard-margin, as we do not allow any violation.<br />
<br />
=====Soft Margin SVM Problem Interpretation =====<br />
<br />
Like in the case of hard-margin the dual formulation for soft-margin given above allows us to interpret the role of certain points as support vectors. <br />
<br />
We consider three cases:<br />
<br />
'''Case 1:''' <math>\,\alpha_i=\gamma</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i = 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now suggests <math>\,\zeta_i > 0</math>. <br />
<br />
Thus this is a point that violates the margin, and we say <math>\,x_i</math> is inside the margin.<br />
<br />
'''Case 2:''' <math>\,\alpha_i=0</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i > 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now implies <math>\,\zeta_i = 0</math>. <br />
<br />
Finally, from KKT condition 3 (first part), <math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) > 1-\zeta_i</math>, and since <math>\,\zeta_i = 0</math>, the point is classified correctly and we say <math>\,x_i</math> is outside the margin. In particular, <math>\,x_i</math> does not play a role in determining the classifier and if we ignored it, we would get the same result.<br />
<br />
'''Case 3:''' <math>\,0 < \alpha_i < \gamma</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i > 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now implies <math>\,\zeta_i = 0</math>. <br />
<br />
Finally, from KKT condition 3 (first part), <math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) = 1-\zeta_i</math>, and since <math>\,\zeta_i = 0</math>, the point is on the margin and we call it a support vector.<br />
<br />
These three scenarios are depicted in Fig..<br />
<br />
'''Case 4:''' if <math>\boldsymbol \zeta_i > 0</math> implies <math>\boldsymbol \lambda_i=0</math> this now implies <math>\boldsymbol \alpha_i=\gamma </math> from which we know that <math>y_1(\beta^*\mathbf{x}+\beta_0)\ge 1-\zeta_i </math> it is closer to the boundary, so <math>x_i</math> is inside the margin.<br />
<br />
=====Soft Margin SVM with Kernel =====<br />
<br />
Like hard-margin SVM, we can use the kernel trick to find a non-linear classifier using the dual formulation.<br />
<br />
In particular, we define a non-linear mapping for <math> \boldsymbol{x_i} </math> as <math> \Phi(\boldsymbol{x_i}) </math>, then in dual objective we compute <math> \Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j}) </math> instead of <math> \boldsymbol{x_i}^T \boldsymbol{x_j} </math>. Using a kernel function <math> K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j}) </math> from the list provided in the previous lecture notes, we then do not need to explicitly map <math> \Phi(\boldsymbol{x_i}) </math>.<br />
<br />
The dual problem we solve is:<br />
<br />
<math><br />
\begin{align}<br />
\max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(\boldsymbol{x_i}, \boldsymbol{x_j}) \\<br />
\text{s.t. } & \sum_i \alpha_i y_i = 0 \\<br />
& 0 \le \alpha_i \le \gamma, \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
where <math>\, K(\boldsymbol{x_i}, \boldsymbol{x_i}) </math> is an appropriate kernel function specification.<br />
<br />
To make it clear why we do not need to explicitly map <math> \Phi(\boldsymbol{x_i}) </math>: If we use the kernel trick, both hard- and soft-margin SVMs find the following value for the optimum <math> \boldsymbol{\beta} </math>:<br />
<br />
<math> \boldsymbol{\beta} = \sum_i \alpha_i y_i \Phi(\boldsymbol{x_i}) </math><br />
<br />
From the definition of the classifier, the class labels for points are given by:<br />
<br />
<math> \boldsymbol{\beta}^T \Phi(\boldsymbol{x}) + \beta_0 </math><br />
<br />
Plugging the formula for <math> \boldsymbol{\beta} </math> in the expression above we get:<br />
<br />
<math> \sum_i \alpha_i y_i \Phi(\boldsymbol{x_i}) \Phi(\boldsymbol{x}) + \Beta_0 </math><br />
<br />
which, from the properties of kernel functions, is equal to:<br />
<br />
<math> \sum_i \alpha_i y_i K(\boldsymbol{x_i}, \boldsymbol{x_i}) + \Beta_0 </math><br />
<br />
Thus, we do not need to explicitly map <math> \boldsymbol{x_i} </math> to a higher dimension.<br />
<br />
=====Soft Margin SVM Implementation =====<br />
<br />
The SVM optimization problem is a quadratic program and we can use any quadratic solver to accomplish this. For example, matlab's optimization toolbox provides <code>quadprog</code>. Alternatively, CVX (by Stephen Boyd) is an excellent optimization toolbox that integrates with matlab and allows one to enter convex optimization problems as though they are written on paper (and it is free). <br />
<br />
We prefer to solve the dual since it is an easier problem (and also allows to use a Kernel). Using CVX this would be coded as<br />
<br />
<pre><br />
K = X*X'; % Linear kernel<br />
H = (y*y') .* K;<br />
cvx_begin <br />
variable alpha(M,1);<br />
maximize (sum(alpha) - 0.5*alpha'*H*alpha)<br />
subject to<br />
y'*alpha == 0; <br />
alpha >= 0;<br />
alpha <= gamma<br />
cvx_end<br />
</pre><br />
<br />
which provides us with optimal <math>\,\boldsymbol{\alpha}</math>. <br />
<br />
Now we can obtain <math>\,\beta_0</math> by using any point on the margin (i.e. <math>\,0 < \alpha_i < \gamma</math>), and solving<br />
<br />
<math><br />
y_i \left(\sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x_i}) + \beta_0 \right) = 1<br />
</math><br />
<br />
Note, <math>\,K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \boldsymbol{x_i}^T \boldsymbol{x_j}</math> can also be the linear kernel. <br />
<br />
Finally, we can classify a new data point <math>\,\boldsymbol{x}</math>, according to<br />
<br />
<math>h(\boldsymbol{x}) = <br />
\begin{cases} <br />
+1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 > 0\\<br />
-1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 < 0<br />
\end{cases}<br />
</math><br />
<br />
Alternatively, using traditional Mat Lab the following code finds b and b0. <br />
<br />
<pre><br />
ell = size(X, 1);<br />
H = (y * y') .* (X * X' + (1/gamma) * eye(ell));<br />
f = -ones(1, ell);<br />
LB = zeros(ell, 1);<br />
UB = gamma * ones(ell, 1);<br />
alpha = quadprog(H, f, [], [], y', 0, LB, UB);<br />
b = X*(alpha.*y);<br />
# Here we try to select the closest point to the margin for b0, thus finding the best origin for our classifer<br />
i =min(find((alpha>0.1)&(y==1)));<br />
b0 = 1 - (X * X')(i, :) * (alpha .* y);<br />
</pre><br />
<br />
===== Intuitive Connection to Hard Margin Case =====<br />
The form of the dual in both the Hard Margin & Soft Margin case are exceedingly; the only difference is a further restriction(<math>\ \alpha_i < \gamma</math>) on the dual variable. You could even implement the soft margin problem to solve a case where the hard margin problem is feasible. This is not typically done but doing can give considerable insight into how the soft margin problem reacts to changes in <math>\ \gamma </math>. If we let <math>\ \gamma \to +\infty</math> we see that the soft margin problem approaches the hard margin problem. If we examine the primal problem this matches our intuitive expectation. As <math>\ \gamma \to +\infty</math> the penalty for being inside the margin increases to infinity and thus the optimal solution will place paramount importance of having a hard margin. <br />
<br />
When choosing <math>\ \gamma </math> one needs to be careful and understand the implications. Values of <math>\ \gamma </math> that are too large will result in slavish dedication to getting as close to a hard margin as possible. This can result in poor decisions especially if there are outliers involved. Values of <math>\ \gamma </math> that are too small do not adequate punish the problem for misclassifying points. It is important to both test different values for <math>\ \gamma </math> and to exercise discretion when selecting possible values of <math>\ \gamma </math> to test. It is also important to examine the impact of outliers as their impact can be extremely destructive to the usefulness of the SVM classifier.<br />
<br />
<br />
===Multiclass Support Vector Machines===<br />
<br />
Support vector machines were originally designed for binary classification; therefore we need a methodology to adopt the binary SVMs to a multi-class problem. How to effectively extend SVMs for multi-class classification is still an ongoing research issue. Currently the most popular approach for multi-category SVM is by constructing and combining several binary classifiers.Different coding and decoding strategies can be used for this purpose among which one-against-all and one-against-one (pairwise) are the most popular ones <ref name="CMBishop" />. .<br />
<br />
====One-Against-All method====<br />
Assume that we have <math>\ k </math> discrete classes. For a one-against-all SVM, we determine <math>\ k </math> decision functions that separate one class from the remaining classes. Let the <math>\ i^{th} </math> decision function, with the maximum margin, that separates class <math>\ i </math> from the remaining classes be:<br />
<br />
<br />
<math>D_i(\mathbf{x})=\mathbf{w}_i^Tf(\mathbf{x})+b_i</math><br />
<br />
<br />
The hyperplane<math>\ D_i(\mathbf{x})=0 </math> forms the optimal separating hyperplane and if the classification problem is separable, the training data <math>\mathbf{x}</math> belonging to class <math>\ i</math> satisfy <br />
<br />
<math>\begin{cases}<br />
F_i(\mathbf{x})\geq1 &,\mathbf{x}\text{ belong to class }i\\<br />
F_i(\mathbf{x})\leq-1 &,\mathbf{x}\text{ belong to remaining classes}\\<br />
\end{cases}<br />
</math><br />
<br />
In other words, the decision function is the sign of <math>\ D_i(\mathbf{x})</math> and therefore it is a discrete function. If the above equation is satisfied for plural <math>\ i's </math> , or there is no <math>\ i </math> that satisfies this equation, <math>\mathbf{x})</math> is unclassifiable. Figure below demonstrates the one-vs-all multi-class scheme where the pink area is the unclassifiable region.<br />
<br />
[[File:one-vs-all multiclass.jpg|400px|thumb|centre|one-against-all multi-class scheme]]<br />
<br />
====One-Against-One (Pairwise) method====<br />
<br />
In this method we construct a binary classifier for each possible pair of classes and therefore for <math>\ k </math> classes we will have <math>\frac{(k)(k-1)}{2} </math> decision functions. The decision function for the pair of classes <math>i</math> and <math>j</math> is given by<br />
<br />
<math>D_{ij}=\mathbf{w}_{ij}^Tf(\mathbf{x})+b_{ij}</math><br />
<br />
<br />
where <math>D_{ij}(\mathbf{x})=-D_{ij}(\mathbf{x})</math>.<br />
<br />
<br />
The final decision is achieved by maximum voting scheme. That is for the datum <math>\mathbf{x}</math> we calculate<br />
<br />
<br />
<math>D_i(\mathbf{x})=\sum_{j\neq i,i=1}sign(D_{ij}(\mathbf{x}))</math><br />
<br />
<br />
And <math>\mathbf{x}</math> is classified into the class: <math>arg\quad \max_i\quad D_i({\mathbf{x}})</math><br />
<br />
<br />
Figure below demonstrates the one-vs-one multi-class scheme where the pink area is the unclassifiable region.<br />
<br />
<br />
<br />
[[File:one-vs-one multiclass.jpg|400px|thumb|centre|one-vs-one multi-class scheme]]<br />
<br />
===Advantages of Support Vector Machines===<br />
<br />
* SVMs provide a good out-of-sample generalization. This means that, by choosing an appropriate generalization grade, <br />
SVMs can be robust, even when the training sample has some bias. This is mainly due to selection of optimal hyperplane.<br />
* SVMs deliver a unique solution, since the optimality problem is convex. This is an advantage compared <br />
to Neural Networks, which have multiple solutions associated with local minima and for this reason may <br />
not be robust over different samples.<br />
*State-of-the-art accuracy on many problems. <br />
*SVM can handle any data types by changing the kernel.<br />
<br />
===Disadvantages of Support Vector Machines===<br />
<br />
*Difficulties in choice of the kernel (Which we will study about in future).<br />
<br />
* limitation in speed and size, both in training and testing <br />
<br />
*Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained.<br />
<br />
*The optimal design for multiclass SVM classifiers is a further area for research.<br />
<br />
*A problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks.<br />
<br />
===Comparison with Neural Networks <ref>www.cs.toronto.edu/~ruiyan/csc411/Tutorial11.ppt</ref>===<br />
<br />
#Neural Networks:<br />
##Hidden Layers map to lower dimensional spaces<br />
##Search space has multiple local minima<br />
##Training is expensive<br />
##Classification extremely efficient<br />
##Requires number of hidden units and layers<br />
##Very good accuracy in typical domains<br />
#SVMs<br />
##Kernel maps to a very-high dimensional space<br />
##Search space has a unique minimum<br />
##Training is extremely efficient<br />
##Classification extremely efficient<br />
##Kernel and cost the two parameters to select<br />
##Very good accuracy in typical domains<br />
##Extremely robust<br />
<br />
=== The Naive Bayes Classifier ===<br />
<br />
The naive Bayes classifier is a very simple (and often effective) classifier based on Bayes rule. <br />
For further reading check [http://www.saylor.org/site/wp-content/uploads/2011/02/Wikipedia-Naive-Bayes-Classifier.pdf]<br />
<br />
Bayes assumption is that all the features are conditionally independent given the class label. Even though this is usually false (since features are usually dependent), the resulting model is easy to fit and works surprisingly well.<br />
<br />
Each feature or variable <math>\,x_{ij}</math> is independent for <math>\,j = 1, ..., d</math>, where <math>\, \mathbf{x}_i \in \mathbb{R}^d</math>.<br />
<br />
Thus the Bayes classifier is<br />
<math> h(\mathbf{x}) = \arg\max_k \quad \pi_k f_k(\mathbf{x})</math><br />
<br />
where <math>\hat{f}_k(\mathbf{x}) = \hat{f}_k(x_1 x_2 ... x_d)= \prod_{j=1}^d \hat{f}_{kj}(x_j)</math>.<br />
<br />
We can see this a direct application of Bayes rule<br />
<math> P(Y=k|X=\mathbf{x}) =\frac{P(X=\mathbf{x}|Y=y) P(Y=y)} {P(X=\mathbf{x})} = \frac{f_k(\mathbf{x}) \pi_k} {\sum_k f_k \pi_k}</math>,<br />
<br />
with <math>\, f_k(\mathbf{x})=f_1(\mathbf{x})f_2(\mathbf{x})...f_k(\mathbf{x})</math> and <math>\ \mathbf{x} \in \mathbb{R}^d</math>.<br />
<br />
Note, earlier we assume class-conditional densitites which were multivariate normal with a dense covariance matrix. In this case we are forcing the covariance matrix to be a diagonal. This simplification, while not realistic, can provide a more robust model.<br />
<br />
As another example, consider the 'iris' dataset in R. We would like to use known data (sepal length, sepal width, petal length, and petal width) to predict species of iris. As is typically done, we will use the maximum a posteriori (MAP) rule to decide the class to which each observation belongs. The code for using a built-in function in R to classify is:<br />
<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
#If you were to use a built-in function for Naive Bayes Classification, <br />
#this is how it would work:<br />
<br />
library(lattice) #these are the libraries from which packages are needed<br />
library(class)<br />
library(e1071)<br />
<br />
count = 0 #This will keep track of properly classified objects<br />
attach(iris)<br />
model <- (Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)<br />
m <- naiveBayes(model, data = iris)<br />
p <- predict(m, iris) #You could also use a table here<br />
for(i in 1:length(Species)) {<br />
if (p[i] == Species[i]) {<br />
count = count + 1<br />
}}<br />
misclass = (length(Species)-count)/length(Species)<br />
misclass<br />
#So we get that 4% of the points are misclassified.<br />
</pre><br />
<br />
In this particular dataset, we would not expect naïve Bayes to be the best approach for classification, since the assumption of independent predictor variables is violated (sepal length and sepal width are related, for example). However, misclassification rate is low, which indicates that naïve Bayes does a good job of classifying these data.<br />
<br />
=== [http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm K-Nearest-Neighbors(k-NN)] ===<br />
<br />
[[File:KNN.jpg|250px|thumb|right|Classifying x by assigning it the label most frequently represented among k nearest samples and use a voting scheme.]]<br />
<br />
Given a data point x, find the k nearest data points to x and classify x using the majority vote of these k neighbors (k is a positive<br />
integer, typically small.) If k=1, then the object is simply assigned to the class of its nearest neighbor.<br />
<br />
<br />
# Ties can be broken randomly.<br />
# k can be chosen by cross-validation<br />
# k-nearest neighbor algorithm is sensitive to the local structure of the data<ref><br />
http://www.saylor.org/site/wp-content/uploads/2011/02/Wikipedia-k-Nearest-Neighbor-Algorithm.pdf</ref>.<br />
# Nearest neighbor rules in effect compute the decision boundary in an implicit manner.<br />
<br />
=====Requirements of k-NN:=====<br />
<ref>http://courses.cs.tamu.edu/rgutier/cs790_w02/l8.pdf</ref><br />
# An integer k<br />
# A set of labeled examples (training data)<br />
# A metric to measure “closeness”<br />
<br />
=====Advantages:=====<br />
# Able to obtain optimal solution in large sample.<br />
# Simple implementation<br />
# There are some noise reduction techniques that work only for k-NN to improve the efficiency and accuracy of the classifier.<br />
<br />
=====Disadvantages:=====<br />
# If the training set is too large, it may have poor run-time performance.<br />
# k-NN is very sensitive to irrelevant features since all features contribute to the similarity and thus to classification.<ref><br />
http://www.google.ca/url?sa=t&rct=j&q=k%20nearest%20neighbors%20disadvantages&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.100.1131%26rep%3Drep1%26type%3Dpdf&ei=3feyToHMG8Xj0QGOoMDKBA&usg=AFQjCNFF1XsYgZy1W2YLQMNTq_7s07mfqg&sig2=qflY4MffEHwP9n-WpnWMdg</ref><br />
# small training data can lead to high misclassification rate.<br />
# kNN suffers from the curse of dimensionality. As the number of dimensions of the feature space increases, points become further apart from each other, making it harder to classify new points. In 10 dimensions, each point needs to cover an area of approximately 80% the value of each coordinate to capture 10% of the data. (See textbook page 23). Algorithms to solve this problem include approximate nearest neighbour. <ref>P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality. STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing. pg 604-613.</ref><br />
<br />
=====Extensions and Applications=====<br />
<br />
In order to improve the obtained results, we can do following:<br />
# Preprocessing: smoothing the training data (remove any outliers and isolated points)<br />
# Adapt metric to data<br />
<br />
Besides classification, k-nearest-neighbours is useful for other tasks as well. For example, the k-NN has been used in Regression or Product Recommendation system<ref><br />
http://www.cs.ucc.ie/~dgb/courses/tai/notes/handout4.pdf</ref>.<br />
<br />
In 1996 Support Vector Regression <ref>"Support Vector Regression Machines". Advances in Neural Information Processing Systems 9, NIPS 1996, 155–161, MIT Press.</ref> was proposed. SVR depends only on a subset of training data since the cost function ignores training data close to the prediction withing a threshold.<br />
<br />
SVM is commonly used in Bioinformatics. Common uses include classification of DNA sequences and promoter recognition and identifying disease-related microRNAs. Promoters are short sequences of DNA that act as a signal for gene expression. In one paper, Robertas Damaševičius tries using a power series kernel function and 11 classification rules for data projection to classifty these sequences, to aid active gene location.<ref>Damaševičius, Robertas. "Analysis of Binary Feature Mapping Rules for Promoter Recognition in Imbalanced DNA Sequence Datasets using Support Vector Machine". Proceedings from 4th International IEEE Conference "Intelligent Systems". 2008.</ref> MicroRNAs are non-coding RNAs that target mRNAs for cleavage in protein synthesis. There is growing evidence suggesting that mRNAs "play important roles in human disease development, progression, prognosis, diagnosis and evaluation of treatment response". Therefore, there is increasing research in the role of mRNAs underlying human diseases. SVM has been proposed as a method of classifying positive mRNA disease-associations from negative ones.<ref>Jiang, Qinghua; Wang, Guohua; Zhang, Tianjiao; Wang, Yadong. "Predicting Human microRNA-disease Associations Based on Support Vector Machine". Proceedings from IEEE International Conference on Bioinformatics and Biomedicine. 2010.</ref><br />
<br />
=====Selecting k=====<br />
Generally speaking, a large k classifies data more precisely than a smaller k as it reduces the overall noise. But as k increases so does the complexity of computation. To determine an optimal k, cross-validation can be used.<ref>http://chem-eng.utoronto.ca/~datamining/dmc/k_nearest_neighbors_reg.htm</ref> Traditionally, k is fixed for each test example. Another approach, namely Adaptive k-nearest neighbor algorithm, was proposed to improve the selection of k. In the algorithm, k is not a fixed number but is dependent on the nearest neighbour of the data point. In training phase, the algorithm calculates the optimal k for each training data point, which is the minimum number of neighbors required to get the correct class label. In the testing phase, it finds out the nearest neighbor of the testing data point and its corresponding optimal k. Then it performs the k-NN algorithm using such k to classify the data point. <ref>Shiliang Sun, Rongqing Huang, "An adaptive k-nearest neighbor algorithm", 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010.</ref><br />
<br />
=====Further Readings=====<br />
1- SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641014 here]<br />
<br />
2- SVM application list[http://www.clopinet.com/isabelle/Projects/SVM/applist.html here]<br />
<br />
3- The kernel trick for distances [http://74.125.155.132/scholar?q=cache:AfKdFY6a1cMJ:scholar.google.com/&hl=en&as_sdt=2000 here]<br />
<br />
4- Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry [http://bioinformatics.oxfordjournals.org/content/20/12/1948.short here]<br />
<br />
5- General overview of SVM and Kernel Methods. Easy to understand presentation. [http://www.support-vector.net/icml-tutorial.pdf here]<br />
<br />
== Supervised Principal Component Analysis (Lecture: Nov. 8, 2011) ==<br />
<br />
Recall that '''PCA''' finds the direction of maximum variation of <math>d</math>-dimensional data, and may be used as a dimensionality reduction pre-processing operation for classification. '''FDA''' is a form of supervised dimensionality reduction or feature extraction that finds the best direction to project the data in order for the data points to be easily separated into their respective classes by considering inter- and intra-class distances (i.e. minimize intra-class distance and variance, maximize inter-class distance and variance). PCA differs from FDA in that PCA is an unsupervised classifier, whereas FDA is supervised classifier. Thus, FDA is better at finding the directions separating the data points for classification in a supervised problem. <br />
<br />
'''Supervised PCA (SPCA)''' is a generalization of PCA. SPCA can use label information for classification tasks and it has some advantages over FDA. For example, FDA will only project onto <math>\ k-1 </math> dimensional space regardless of the dimensionality of the data where <math>\ k </math> is the number of classes. This is not always desirable for dimensionality reduction.<br />
<br />
SPCA estimates the sequence of principal components having the maximum dependency on the response variable. It can be solved in closed-form, has a dual formulations that reduces the computational complexity when the dimension of the data is significantly greater than the number of data points, and it can be kernelized. <ref>Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri. Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds , Journal of Pattern Recognition, to appear 2011</ref><br />
<br />
===SPCA Problem Statement===<br />
Suppose we are given a set of data <math>\ \{x_i, y_i\}_{i=1}^n , x_i \in R^{p}, y_i \in R^{l}</math>. Note that <math>\ y_i</math> is not restricted to binary classes. So the assumption of having only discrete values for labels is relaxed here, which means this model can be used for regression as well. Target values (<math>\ y </math>) don't have to be in a one dimensional space. Just as for PCA, we are looking for a lower dimensional subspace <math>\ S = U^T X </math>, where <math>\ U </math> is an orthogonal projection. However, instead of finding the direction of maximum variation (as is the case in regular PCA), we are looking for the subspace that contains as much predictive information about <math>\ Y </math> as the original covariate <math>\ X </math>, i.e. we are trying to determine a projection matrix <math>\ U</math> such that <math>\ P(Y|X)=P(Y|U^TX) </math>. We know that the predictive information must exist between the original covariate <math>\ X </math> and <math>\ Y </math>, which are assumed to be drawn iid from the distribution <math>\ \{x_i, y_i\}_{i=1}^n </math>, because if they are completely independent there is no way of doing classification or regression.<br />
<br />
===Warning===<br />
If we project our data into a high enough dimension, we can fit any data - even noise. In his book "The God gene: how faith is hardwired into our genes", Dean H. Hamer discusses how factor analysis (model which "uses regression modelling techniques to test hypotheses producing error terms" <ref>use regression modelling techniques to test hypotheses producing error terms</ref>) was used to find a correlations between the gene (VMAT2) and a person's belief in God. The full book is available at: <ref>http://books.google.ca/books?id=TmR6uAAHEssC&pg=PA33&lpg=PA33&dq=god+gene+statistics&source=bl&ots=8q-jSwKZ8O&sig=O8OBe2YaPbE0vMp9A6PxEC9DwL0&hl=en&ei=lWO8Tp_nN4H40gGA2uXjBA&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCEQ6AEwAQ#v=onepage&q&f=false </ref>. <br />
<br />
It appears as though finding a correlation between seemingly uncorrelated data is sometimes statistically trivial. One study found correlations between people shopping habits and their genetics. Family members were shown to have far more similar consumer habits than those who did not share DNA. This was then used to explain "fondness for specific products such as chocolate, science-fiction movies, jazz, hybrid cars and mustard." <ref>http://www.businessnewsdaily.com/genetics-incluence-shopping-habits-0593/</ref>.<br />
<br />
The main idea is that when we are in a highly dimensional space <math>\ \mathbb{R}^d</math>, if we do not have enough data (i.e. <math>n \approx d</math>), then it is easy to find a classifier that separates the data across its many dimensions.<br />
<br />
===Different Techniques for Dimensionality Reduction===<br />
* Classical '''Fisher's Discriminant Analysis (FDA)'''<br />
<br />
The goal of FDA is to reduce the dimensionality of data in <math>\ \mathbb{R}^d</math> in order to have separable data points in a new space <math>\ \mathbb{R}^{d-1}</math>.<br />
<br />
* '''Metric Learning (ML)'''<br />
<br />
This is a large family of methods.<br />
<br />
* '''Sufficient Dimensionality Reduction (SDR)'''<br />
<br />
This is also a family of methods. In recent years SDR has been used to denote a body of new ideas and methods for dimension reduction. Like Fisher's classical notion of a sufficient statistic, SDR strives for reduction without loss of information. But unlike sufficient statistics, sufficient reductions may contain unknown parameters and thus need to be estimated.<br />
<br />
* '''Supervised Principal Components (BSPC)'''<br />
<br />
A method proposed by Bair et al. This is a different method from the SPCA method discussed in class despite having a similar name.<br />
<br />
===Metric Learning ===<br />
First define a new metric as:<br />
<br />
<math>\ d_A(\mathbf{x}_i, \mathbf{x}_j)=||\mathbf{x}_i -\mathbf{x}_j|| = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j)}</math> <br />
<br />
This metric will only satisfy the requisite properties of a metric if <math>\ A </math> is a positive definite matrix. <br />
This restriction is often relaxed to positive semi-definate. Relaxing this condition may be required if we wish to disregard uninformative covariated. <br />
<br />
''Note 1:'' <math>\ A </math> being positive semi-definite ensures that this metric respects non-negativity and the triangle inequality, but allows <math>\ d_A(\mathbf{x}_i,\mathbf{x}_j)=0</math> to not imply <math>\ \mathbf{x}_i=\mathbf{x}_j</math> <ref name="Xing">Xing, EP. Distance metric learning with application to clustering with side-information. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.7952&rep=rep1&type=pdf]</ref>. <br />
<br />
''Common choices for A'' <br />
<br />
1)<math>\ A=I</math> This represents Euclidean distance. <br />
<br />
2)<math>\ A=D</math> where <math>\ D</math> is a diagonal matrix. The diagonal values can be thought of reweighting the importance of each covariate and these weights are learned can be learned from training data.<br />
<br />
3)<math>\ A=D</math> where <math>\ D</math> is a diagonal matrix with <math>\ D_{ii} = Var(i^{th} covariate)^{-1} </math> This represents scaling down each covariate so that they all have equal variance and thus have equal impact on the distance. This metric is consistant with and works very well for covariates that are independant and normally distributed.<br />
<br />
4)<math>\ A=\Sigma^{-1} </math> where <math>\ \Sigma </math> is the covariance matrix for your set of covariates. This metric is consistant with and works very well for covariates that are normally distributed. The corresponding metric is called Mahalanobis distance.<br />
<br />
When dealing with data that are on different measurement scales using choices 3 or 4 are vastly preferable to Euclidean distance as it prevents covariates with large measurement scales from dominating the metric. <br />
<br />
<br />
For metric learning, construct the Mahalonobis distance over the input space and use it instead of the Euclidean distance. This is really equivalent to transforming the data points using a linear transformation and then computing the Euclidean distance in the new transformed space. To see that this is true, suppose we project each data points on a subspace <math>\ S </math> using <math>\ x^' = U^Tx</math> and calculate the Euclidean distance: <br />
<br />
<math>\ ||\mathbf{x}_i^' - \mathbf{x}_j^'||_2^2= (U^T\mathbf{x}_i -U^T\mathbf{x}_j)^T(U\mathbf{x}_i -U\mathbf{x}_j) = (\mathbf{x}_i -\mathbf{x}_j)^TU^TU(\mathbf{x}_i -\mathbf{x}_j)</math> <br />
<br />
This is the same as Mahalanobis distance in the new space for <math>\ A=UU^T</math>.<br />
<br />
1<br />
One way to find <math>\ A</math> is to consider the set of similar pairs <math>\ (\mathbf{x}_i,\mathbf{x}_j) \in S</math> and the set of dissimilar pairs <math>\ (\mathbf{x}_i,\mathbf{x}_j) \in D</math>. Then we can solve the convex optimization problem below <ref name="Xing" />.<br />
<br />
<math> min_A \sum_{(\mathbf{x}_i,\mathbf{x}_j)\in S} (\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j) </math><br />
<br />
s.t. <math> \sum_{(\mathbf{x}_i,\mathbf{x}_j)\in D} (\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j)\ge 1 </math> and <math>\ A</math> positive semi-definite.<br />
<br />
<br />
Overall, the metric learning technique will attempt to minimize the squared induced distance between similar points while maximizing the squared induced distance between dissimilar points and search for a metric which allows points from the same class to be near one another and points from different classes to be far from one another.<br />
<br />
===Sufficient Dimensionality Reduction (SDR)===<br />
<br />
The goal of dimensionality reduction is to find a function <math>\ S(\mathbf{x}) </math> that maps <math>\ \mathbf{x} </math> from <math>\ \mathbb{R}^n </math> to a proper subspace, which means that the dimension of <math>\ \mathbf{x} </math> is being reduced. An example of <math>\ S(\mathbf{x}) </math> would be a function that uses several linear combinations of <math>\ \mathbf{x} </math>.<br />
<br />
For a dimensionality reduction to be sufficient the following condition must hold:<br />
<br />
::<math>\ P_{Y|X}(y|x) = P_{Y|S(X)}(y|S(x)) </math><br />
<br />
Which is equivalent to saying that the distribution of <math>\ y|S(\mathbf{x})</math> is the same as <math>\ y |\mathbf{x} </math> [http://rsta.royalsocietypublishing.org/content/367/1906/4385.full]<br />
<br />
This method aims to find a linear subspace <math>\ R </math> such that the projection onto this subspace preserves <math>\ P_{Y|X}(y|x) </math>.<br />
<br />
Suppose that <math>\ S(\mathbf{x}) = U^T\mathbf{x} </math> is a sufficient dimensional reduction, then<br />
<br />
<math>\ P_{Y|X}(y|x) = P_{Y|U^TX}(y|U^T x) </math><br />
<br />
for all <math>\ x \in X </math>, and <math>\ y \in Y </math>, where <math>\ U^T X </math> is the orthogonal projection of <math>\ X </math> onto <math>\ R </math>.<br />
<br />
====Graphical Motivation====<br />
In a regression setting, it is often useful to summarize the distribution of <math>y|\textbf{x}</math> graphically. For instance, one may consider a scatter plot of <math>y</math> versus one or more of the predictors. A scatter plot that contains all available regression information is called a sufficient summary plot.<br />
<br />
When <math>\textbf{x}</math> is high-dimensional, particularly when the number of features of <math>\ X </math> exceed 3, it becomes increasingly challenging to construct and visually interpret sufficiency summary plots without reducing the data. Even three-dimensional scatter plots must be viewed via a computer program, and the third dimension can only be visualized by rotating the coordinate axes. However, if there exists a sufficient dimension reduction <math>R(\textbf{x})</math> with small enough dimension, a sufficient summary plot of <math>y</math> versus <math>R(\textbf{x})</math> may be constructed and visually interpreted with relative ease.<br />
<br />
Hence sufficient dimension reduction allows for graphical intuition about the distribution of <math>y|\textbf{x}</math>, which might not have otherwise been available for high-dimensional data.<br />
<br />
Most graphical methodology focuses primarily on dimension reduction involving linear combinations of <math>\textbf{x}</math>. The rest of this article deals only with such reductions.[http://en.wikipedia.org/wiki/Sufficient_dimension_reduction#Graphical_motivation]<br />
<br />
====Other Methods for Reduction====<br />
Two very common examples of SDR are Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE). More information on SIR can be found here [http://en.wikipedia.org/wiki/Sliced_inverse_regression]. In addition [http://mars.wiwi.hu-berlin.de/mediawiki/teachwiki/index.php/Sliced_Inverse_Regression] also provides some examples for SIR.<br />
<br />
===Supervised Principal Components (BSPC)===<br />
<br />
BSPC algorithm:<br />
<br />
1. Compute (univariate) standard regression coefficients for each feature j using the following formula:<br />
<br />
<math>\ s_j=\frac{{X_j}^TY}{\sqrt{X_j^T X_j}} </math><br />
<br />
2. Reduce the data matrix <math>Xo </math> corresponding to all the columns where <math>\ |S_j|>\theta</math>. Find <math>\ \theta</math> by cross-validation. <br />
<br />
3. Compute the first principal component of the reduced data matrix <math>Xo </math><br />
<br />
4. Use the principal component calculated in step (3) in a regression model or a classification algorithm to produce the outcome<br />
<br />
<br />
Bair's SPCA is consistent. In Normal PCA as the number of data points increases PCA takes different directions for the components. However the direction of the first component of SPCA remains consistent as the number of points increase <ref>Bair E., Prediction by supervised principal components. [http://stat.stanford.edu/~tibs/ftp/spca.pdf]</ref>.<br />
<br />
===Hilbert-Schmidt Independence Criterion (HSIC)===<br />
"Hilbert-Schmidt Norm of the Cross-Covariance operator" is proposed as an independence criterion in reproducing kernel Hilbert spaces (RKHSs).<br />
<br />
The measure is refered to as '''Hilbert-Schmidt Indepence Criterion (HSIC)'''.<br />
<br />
Let <math>\ z=\{(x_1,y_1),...,(x_n,y_n)\} \in\ \mathcal{X}</math>x<math>\mathcal{Y}</math> be a series of <math>\ n</math> independent observation drawn from <math>\ P_{(X,Y)}(x,y)</math> . An estimator of HSIC is given by <br />
<br />
<math>HSIC=\frac{1}{(n-1)^2}Tr(KHBH)</math> <br />
<br />
where H,K,B <math>\in\mathbb{R}^{n x n}</math><br />
<br />
<math>K_{ij} =k(x_i,x_j),B_{ij}=b(y_i,y_j), H=I-\frac{1}{n}\boldsymbol{e} \boldsymbol{e}^{T} </math>, where <math>\ k</math> and <math>\ b</math> are positive semidefinite kernel functions, and <math>\ \boldsymbol{e} = [1 1 \ldots 1]^T</math>.<br />
<br />
XH is centralized version of X ( subtracting the mean of each row):<br />
<br />
<math>XH=X(I- \frac{1}{n}\boldsymbol{e} \boldsymbol{e}^T)=X -\frac{1}{n}X\boldsymbol{e} \boldsymbol{e}^T</math> where each entry in row i of <math>\frac{1}{n}Xee^T</math> is mean of <math>i^{th}</math> row of X<br />
<br />
<math>HBH</math> is double centeralized version of B (subtracting mean of row and column)<br />
<br />
We introduced a way of measuring independence between two distributions. The key idea is that good features should maximize such dependence. Feature selection for various supervised learning problems is unified under HSIC, and the solutions can be approximated using a backward-elimination algorithm. To explain this, we started by explaining how to tell if two distributions are same. Specifically, if two distributions have different mean values, then we can say right away that these are two different distributions. However, if they share a same mean value, then we need to look at second moments of these distributions, from which we can derive variance. Hence we need to look at higher dimension to tell if two distributions are equal.<br />
<br />
It can be mathematically shown(although not done in class) that if we define a mapping,<math>\ \phi </math> of random variable X, which maps X to higher dimension, then there exists a unique mapping between the <math>\ \mu_x</math>, which is the average of x in the higher dimension, and the distribution of X. This suggests that <math>\ \mu_x</math> can reproduce the distribution of P.<br />
<br />
Hence to figure out if two random variables X and Y have the same distribution, we can take the difference between E<math>\ \phi </math>(x) and E<math>\ \phi</math>(y), and take the norm of this to see if two distributions are equal.<br />
i.e.<br />
<math>|| E \phi (x) - E\phi(y) ||^2</math><br />
If this value is equal to 0, then we know that they have the same distribution.<br />
<br />
Now to test the independence of <math>\ P_x</math> and <math>\ P_y</math> then we can use the previous formula on <math>\ P_{xy}</math> and <math>\ (P_x)(P_y)</math> - if it equals 0, then two distributions <math>\ P_x</math> and <math>\ P_y</math> are independent. The larger the difference is, then the distributions of X and Y are more different.<br />
<br />
Utilizing this, we can find the <math>\ U^TX </math> from <math>\ P(Y|X)=P(Y|U^TX) </math> such that it maximizes the HSIC between <math>\ Y</math>, which implies the maximum dependence between <math>\ U^TX </math> and <math>\ Y</math>.<br />
<br />
<br />
you come up with index called HSIC:<br />
<br />
<math>\ KHBH </math><br />
<br />
X, Y random variables.<br />
<br />
K- kernel matrix over X.<br />
<br />
B- kernel matrix over Y.<br />
<br />
==='''Kernel Function'''===<br />
A positive definite kernel can always be written as inner products of a feature mapping.<br /><br />
To prove a valid kernel function:<br /><br />
1. define a feature <math> \phi(x) </math> mapping into some vector space.<br /><br />
2. define a dot product in a strictly positive definite form<br /><br />
3. Show that <math>\ k(x, x') = <\phi(x),\phi(x')></math><br /><br />
[http://www.public.asu.edu/~ltang9/presentation/kernel.pdf]</ref>.<br />
<br />
Kernel function will be used when calculating <math>\|| E\phi(x) - E\phi(y) ||^2</math><br />
The possible kernel functions we can choose are:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
* Delta Kernel: <math>\,k(x_i,x_j) =<br />
\begin{cases}<br />
1 & \text{if }x_i=x_j \\ 0 & \text{if }x_i\ne x_j<br />
\end{cases}<br />
</math><br />
<br />
H is a constant matrix of the form: <math>\ H = I - \frac{1}{n}ee^T </math><br />
<br />
where, <math>\ e = \left( \begin{array}{c}1<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
1 \\ \\<br />
1 \end{array} \right) </math>.<br />
<br />
H centralizes any matrix that you multiply it to.<br />
So HBH makes B double centred<br />
<br />
<br />
We wanted the transformation <math>\ U^TX </math> such that it had the maximum dependance to Y. So we use the index HSIC to find the dependance between U^TX and Y and maximize it.<br />
<br />
'''H''' centralize the mean of X by XH<br />
<math>X-\mu</math>: the larger the value is, they dependence more of each other.<br />
<br />
So basically we want to maximize <math>\ Tr(KHBH)</math><br />
<br />
<math>\ max Tr(KHBH)</math><br />
<br />
<math>\ max Tr(X^TUU^TXHBH)</math><br />
<br />
<math>\ max Tr(U^TXHBHX^TU)</math><br />
<br />
we add a constraint to solve this problem <br />
<br />
<math>\ U^TU=I</math><br />
<br />
Then this is identical to PCA if <math>\ B=I</math><br />
<br />
===SPCA: Supervised Principle Compenent Analysis===<br />
<br />
We need to find <math>\ U </math> to maximize <math>\ Tr(HKHB) </math> <br />
where K is a Kernel of <math>\ U^T X </math> (eg: <math>\ X^T UU^T X </math>) and <math>\ B </math> is a Kernel of <math>\ Y </math>(eg: <math>\ Y^T Y </math>):<br />
<br />
{| class="wikitable" cellpadding="5"<br />
|- align="center" <br />
! <math>\ X </math><br />
! <math>\ Y </math><br />
|- align="center"<br />
| <math>\ U^T X </math><br />
| <math>\ Y </math><br />
|-<br />
| <math>\ (U^T X)^T (U^T X) = X^T UU^T X </math><br />
| <math>\ B </math><br />
|}<br />
<br />
<math>\max \; Tr(HKHB) </math><br />
<math>\ \; \; = \; \max Tr(HX^T UU^T XHB) </math><br />
<math>\ \; \; = \; \max Tr(U^T XHBHX^T U) </math><br />
<math>\ subject \; to \; U^T U = I </math><br />
<br />
===Supervised Principle Components Analysis and Conventional PCA===<br />
<br />
[[File:012DR-PCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using PCA]]<br />
[[File:012DR-SPCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using Supervised PCA]]<br />
<br />
<br />
This is idential to PCA if B = I<br />
<br />
<math>(XHBHX^T) = cov(x) = (x-\mu)(x-\mu)^T</math><br />
<br />
===SPCA===<br />
Algorithm 1 <br /><br />
- Recover basis: Calculate <math>Q=XHBHX^T</math> and let u=eigenvector of Q corresponding to the top d eigenvalues.<br /><br />
- Encode training data: <math>Y=U^TXH</math> where Y is dxn matrix of the original data <br /><br />
- Reconstruct training data: <math>\hat{X}=UY=UU^TX</math> <br /><br />
- Encode test example: <math>y=U^T(x-\mu)</math> where y is a d dimensional encoding of x. <br /><br />
- Reconstruct test example: <math>\hat{X}=U_y=UU^T(x-\mu)</math> <br /><br />
<br />
Find U that would maximize <math>Tr(HKHB)</math> where K is a kernel of <math>U^TX</math> (e.g. <math>K=x^Tuu^Tx</math>) and B is a kernel of Y (e.g. <math>B=y^Ty</math>).<br />
<br />
<math><br />
max_U Tr(KHBH) <br />
= max_U Tr(x^Tuu^TxHBH) <br />
= max_U Tr(u^TxHBHx^Tu) </math> since we can switch the order around for traces<br />
<br />
===Dual Supervised Principle Component Analysis===<br />
<br />
<br />
Let <math>Q = XHBHX^T</math> and B are both PSD<br />
<br />
<math>Q = \psi\psi^T</math><br />
<math>B = \Delta\Delta^T</math><br />
<math>\psi = XH\Delta^T</math><br />
<br />
The solution for U can be expressed as singular value decomposition (SVD) of <math>\psi</math>:<br />
<br />
<math>\psi = U \Sigma V^T</math><br />
<math>\rightarrow \psi V = U \Sigma</math><br />
<math>\rightarrow \psi V \Sigma^-1 = U</math><br />
<math>\rightarrow \Sigma^{-1} V^T \psi^T XH </math><br />
<math>\rightarrow \Sigma^{-1} V^T V \Sigma^T U^T XH </math><br />
<br />
It gives a relationship between V and U. Your can replace these in the algorithm above and define everything based on V instead of U. By doing this you do not need to find eigenvectors of Q which have a high dimensionality.<br />
<br />
<br />
Algorithm 2 <br /><br />
Recover basis: calculate <math>\psi^T \psi</math> and let V=eigenvector of <math>\psi^T \psi</math> corresponding to the top d eigenvalues. Let <math>\Sigma</math>=diagonal matrix of square roots of the top d eigenvalues. <br /><br />
<br />
Reconstruct training data:<br />
<math>\hat{X}=UZ=XH\Delta^T V \Sigma^{-2}V^T\Delta H(X^T X)H </math> <br /><br />
<br />
Encode test examples: <math>y=U^T(x-\mu)=\Sigma^{-1}V^T \Delta H[X^T(x-\mu)] </math> where y is a d dimensional encoding of x.<br />
<br />
===Towards a Unified Network===<br />
<br />
{| class="wikitable"<br />
|-<br />
! <br />
! B<br />
! Constraint<br />
! Component<br />
|-<br />
| PCA<br />
| I<br />
| <math>\omega^T \omega = I</math><br />
| <br />
|-<br />
| FDA<math>^{(1)}</math><br />
| <math>B_0</math><br />
| <math>\omega^T S_\omega \omega = I</math><br />
| <math>S_\omega = X B_s X^T</math><br />
|-<br />
| CFML I<math>^{(2)}</math><br />
| <math>B_0 - B_s</math><br />
| <math>\omega^T \omega = I</math><br />
| <br />
|-<br />
| CFML II<math>^{(2)}</math><br />
| <math>B_0</math><br />
| <math>\omega^T S_\omega \omega = I</math><br />
| <math>S_\omega = X B_s X^T</math><br />
|}<br />
(1)<math>B_s=F(F^{T}F)^{-1}F^T</math>, (2) <math>B_s=\tfrac{1}{n}FF^{T}</math> ,<math>B_D=H-B_s</math>, <math>n</math> # of data points,<br />
<math>F</math> indicator matrix of cluster, <math>H</math> the centering matrix<br />
<br />
===Dual Supervised PCA===<br />
{| class="wikitable"<br />
|-<br />
! <br />
! B<br />
! Constraint<br />
! Component<br />
|-<br />
| KPCA<br />
| I<br />
| <math>UU^T = I</math><br />
| Arbitrary<br />
|-<br />
| K-means<br />
| I<br />
| <math>UU^T = I, U\ge 0</math><br />
| Linear<br />
|}<br />
<br />
== Boosting (Lecture: Nov. 10, 2011) ==<br />
<br />
Boosting is a meta-algorithm for starting with a simple classifier and improving the classifer by refitting the data giving higher weight to misclassified samples. <br />
<br />
<br />
Suppose that <math>\mathcal{H}</math> is a collection of classifiers. Assume that <br />
<math>\ y_i \in \{-1, 1\} </math> and that each <math>\ h(x)\in \{-1, 1\} </math>. Start with <math>\ h_1(x) </math>. Based on how well <math>\ h_1 (x) </math> classifies points, adjust the weights of each input and reclassify. Misclassified points are given higher weight to ensure the classifier "pays more attention" to them, to fit better in the next iteration. The idea behind boosting is to obtain a classification rule from each classifer <math> h_i(x)\in\mathcal{H}</math>, regardless of how well it classifies the data on its own (with the proviso that its performance be better than chance), and combine all of these rules to obtain a final classifier that performs well. <br />
<br />
[[File:boosting1.jpg]]<br />
<br />
<br />
An intuitive way to look at boosting and the concept of weight is to think about extreme weightings. Suppose you are doing classification on a set with some points being misclassified. Suppose that any points that have been classified correctly are to be removed from the data. So the weak classifier may do a good job on these new data. This is how early versions of boosting worked, instead of re-weighting. <br />
<br />
=== AdaBoost ===<br />
'''Adaptive Boosting (AdaBoost)''' was formulated by Yoav Freund and Robert Schapire. AdaBoost is defined as an algorithm for constructing a “strong” classifier as linear combination <math>f(\mathbf{x}) = \sum_{t=1}^T \alpha_t h_t(\mathbf{x}) </math> of simple “weak” classifiers <math>\ h_t(\mathbf{x})</math>. It is very popular and widely known as the first algorithm that could adapt to weak learners <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf </ref>. <br />
<br />
It has the following properties:<br />
<br />
* It is a linear classifier with all its desirable properties<br />
* It has good generalization properties<br />
* It is a feature selector with a principled strategy (minimisation of upper bound on empirical error)<br />
* It is close to sequential decision making<br />
<br />
====Algorithm Version 1====<br />
The AdaBoost algorithm presented in the lecture is as follows (for more info see [http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf]):<br />
<br />
1 Set the weights <math>\ w_i=\frac{1}{n}, i = 1,...,n. </math> <br /><br />
<br />
2 For <math>\ j =1,...,J </math>, do the following steps:<br />
<br />
:a) Find the classifier <math>\ h_j: \mathbf{x} \rightarrow \{-1,1\} </math> that minimizes the weighted error <math>\ L_j </math>:<br />
<br />
:<math>\ h_j= arg \underset{h_j\in \mathcal{H}}{\mbox{min}} L_j</math><br />
<br />
:where <math>\ L_j = \frac{\sum_{i=1}^{n}w_iI[y_i\ne h_j(x_i)]}{\sum_{i=1}^{n} w_i}</math><br />
<br />
:<math>\ H </math> is a set of classifiers which need to be improved and <math>\ I</math> is<br />
::<math>\, I= \left\{\begin{matrix} <br />
1 & for \quad y_i\neq h_j(\mathbf{x}_i) \\ <br />
0 & for \quad y_i = h_j(\mathbf{x}_i) \end{matrix}\right.</math><br /><br />
<br />
:b) Let <math>\alpha_j= log(\frac{1-L_j}{L_j})</math><br />
<br />
::Note that <math>\ \alpha</math> indicates the "goodness" of the classifier, where a larger <math>\ \alpha</math> value indicates a better classifier. Also, <math>\ \alpha</math> is always 0 or positive as long as the classification accuracy is 0.5 or higher. For example, if working with coin flips, then <math>\ L_j=0.5 </math> and <math>\ \alpha=0</math>.<br />
<br />
:c) Update the weights:<br />
<br />
::<math>\ w_i \leftarrow w_i e^{\alpha_j I[y_i\ne h_j(\mathbf{x}_i)]}</math><br />
::Note that the weights are only increased for points that have been misclassified by a good classifier.<br /> <br />
<br />
3 The final classifier is: <math>\ h(\mathbf{x}) = sign (\sum_{j=1}^{J}\alpha_j h_j(\mathbf{x}))</math>. <br />
<br />
:Note that this is basically an aggregation of all the classifiers found and the classification outcomes of better classifiers are weighted more using <math>\ \alpha</math>.<br />
<br />
====Algorithm Version 2 <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf</ref>====<br />
One of the main ideas of this algorithm is to maintain a distribution or set of weights over the training set. Initially, all weights are set equally, but on each round, the weights of incorrectly classified examples are increased so that the weak learner is forced to focus on the hard examples in the training set.<br />
<br />
* Given <math>\left(\mathbf{x}_1,y_1\right),\dots,\left(\mathbf{x}_m,y_m\right)</math> where <math>{\mathbf{x}_i \in X}</math>, <math>{y_i \in \{-1,+1\}}</math>.<br />
* Initialize weights <math>D_1(i) = \frac{1}{m}</math><br />
* Iterate <math>t=1,\dots, T</math><br />
** Train weak learner using distribution <math>\ D_t</math><br />
** Get weak classifier: <math>h_t:X\rightarrow R</math><br />
** Choose <math>{\alpha_t \in R}</math><br />
** Update the weights: <math>D_{t+1}(i) = \frac {D_i e^{-\alpha_t y_i h_t(\mathbf{x}_i)}} {Z_t}</math><br />
:: where <math>\ Z_t</math> is a normalization factor (chosen so that <math>\ D_t+1</math> will be a distribution)<br />
* The final classifier is:<br />
:: <math>H(\mathbf{x})=\mbox{sign}\left(\sum_{t=1}^T \alpha_t h_t(\mathbf{x})\right)</math><br />
<br />
====Example====<br />
<br />
In R, we can do boosting on a simulated classifer. Suppose we are working with the built-in R dataset "iris". These data consist of petal length, sepal length, petal width, and sepal width of three different species of iris. This is an adaptive boosting algorithm as applied to these data.<br />
<br />
<pre style = "align:left; width:100%; padding: 2% 2%"><br />
> crop1 <- iris[1:100,1] #the function "ada" will only handle two classes<br />
> crop2 <- iris[1:100,2] #and the iris dataset has 3. So crop the third off.<br />
> crop3 <- iris[1:100,3]<br />
> crop4 <- iris[1:100,4]<br />
> crop5 <- iris[1:100,5] #This is the response variable, indicating species of iris<br />
> x <- cbind(crop1, crop2, crop3, crop4, crop5) #combine all the columns<br />
> fr1 <- as.data.frame(x, row.names=NULL) #and coerce into a data frame<br />
> <br />
> a = 2 #number of iterations<br />
> AdaBoostDiscrete <- ada(crop5~., data=fr1, iter=a, loss="e", type = "discrete", control = rpart.control())<br />
> AdaBoostDiscrete <br />
Call:<br />
ada(crop5 ~ ., data = fr1, iter = a, loss = "e", type = "discrete", <br />
control = rpart.control())<br />
<br />
Loss: exponential Method: discrete Iteration: 2 <br />
<br />
Final Confusion Matrix for Data:<br />
Final Prediction<br />
True value 1 2<br />
1 50 0<br />
2 0 50<br />
<br />
Train Error: 0 <br />
<br />
Out-Of-Bag Error: 0 iteration= 1 <br />
<br />
Additional Estimates of number of iterations:<br />
<br />
train.err1 train.kap1 <br />
1 1 <br />
<br />
> #Since this yields "perfect" results, we may not need boosting here after all.<br />
> #This was just an illustration of the ada function in R.<br />
</pre><br />
<br />
====Advantages and Disadvantages====<br />
The advantages and disadvantages of AdaBoost are listed below.<br />
<br />
Advantages :<br />
* Very simple to implement <br />
* Fairly good generalization<br />
* The prior error need not be known ahead of time<br />
<br />
Disadvantages:<br />
* Suboptimal solution<br />
* Can over fit in presence of noise<br />
<br />
===Other boosters===<br />
There are many other more recent boosters such as LPBoost, TotalBoost, BrownBoost, MadaBoost, LogitBoost, stochastic boost etc. The main difference between many of them is the way they weigh the points in the training data set at each iteration. Some of these boosters, such as AdaBoost, MadaBoost and LogitBoost, can be interpreted as performing a gradient descent to minimize the convex cost function (They fit into the AnyBoost framework). However, a recent research study showed that this class of boosters are prone to random classification noise, thereby questioning their applicability to real world noisy classification problems. <ref>Pillip M. Long, Rocco A. Servedio, "Random Classification Noise Defeats All Convex Potential Boosters", 2000</ref><br />
<br />
=== Relation to SVM ===<br />
SVM and Boosting are very similar except for the way to measure the margin or the way they optimize their weight vector. SVMs use the <math>l_2</math> norm for both the instance vector and the weight vector, while Boosting uses the <math>l_1</math> norm for the weight vector. ie. SVMs need to use the <math>l_2</math> norm to implicitly compute scalar products in feature space with the help of the kernel trick. No other norm can be expressed in terms of scalar products.<br />
<br />
Although SVM and AdaBoost share some similarities. However, there are several important differences:<br />
* Different norms can result in very different margins: In boosting or in SVM, the dimension is usually very high, this makes the difference between <math>l_1</math> norm and <math>l_2</math> norm can be significant enough in the margin values.<br />
<br />
e.g suppose the weak hypotheses all have range {-1,1} and that the label y on all examples can be computed by a majority vote of k of the weak hypotheses. In this case, it can be shown that if the number of relevant weak hypotheses is a small fraction of the total number of weak hypotheses then the margin associated with AdaBoost will be much larger than the one associated with support vector machines.<br />
<br />
* The computation requirements are different: The difference between the two methods in this regard is that SVM cor-responds to quadratic programming, while AdaBoost corresponds only to linear programming.<br />
<br />
* A different approach is used to search efficiently in high dimensional space: SVM deals with overfitting problem through the method of kernels which allow algorithms to perform low dimensional calculations that are mathematically equivalent to inner products in a high dimensional “virtual” space. While, boosting approach often employ greedy search method.<ref>http://www.iuma.ulpgc.es/camellia/components/com_docman/dl2.php?archive=0&file=c3ZtX2FuZF9ib29zdGluZ19vbmUucGRm</ref><br />
<br />
== Bagging ==<br />
<br />
[[File: Bagging.jpg|250px|thumb|When bagging, we split up the data, train separate classifiers and then recreate a final classifier]]<br />
<br />
'''Bagging (Bootstrap aggregating)''' was proposed by Leo Breiman in 1994. Bagging is another meta-algorithm for improving classification results by combining the classification of randomly generated training sets. [http://www.wikicoursenote.com/wiki/Stat841f10.htm#Bagging][http://en.wikipedia.org/wiki/Bootstrap_aggregating]<br />
<br />
<br />
<br />
The idea behind bagging is very similar to that behind boosting. However, instead of using multiple classifiers on essentially the same dataset (but with adaptive weights), we sample from the original dataset containing m items B times with replacement, obtaining B samples each with m items. This is called bootstrapping. Then, we train the classifier on each of the bootstrapped samples. Taking a majority vote of a combination of all the classifiers, we arrive at a final classifier for the original dataset. [http://www.cs.princeton.edu/courses/archive/spr07/cos424/assignments/boostbag/index.html]<br />
<br />
Bagging is the effective intensive procedure that can improve on unstable classifiers. It is most useful for highly nonlinear classifiers, such as trees. <br />
<br />
As we know the idea of boosting is to incorporate unequal weights in learning h given higher weight to misclassified points. Bagging is a method for reducing the variability of a classifier. The idea is to train classifiers <math>\ h_{1}(x)</math> to <math>\ h_{B}(x)</math> using B bootstrap samples from the data set. The final classification is obtained using an average or 'plurality vote' of the B classifiers as follows:<br />
<br />
<br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 & \frac{1}{B} \sum_{i=1}^{B} h_{b}(x) \geq \frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
=== Boosting vs. Bagging ===<br />
<br />
• boosting can help us do the procedure on stable models, but bagging may not work for stable models.<br />
<br />
• bagging is easier to parallelize and more helpful in practice.<br />
<br />
• Many classifiers, such as trees, already have underlying functions that estimate the class probabilities at x. An alternative strategy is to average these class probabilities instead of the final classifiers. This approach can produce bagged estimates with lower variance and usually better performance.<br />
<br />
• Bagging doesn’t work so well with stable models.Boosting might still help.<br />
<br />
• Boosting might hurt performance on noisy datasets. Bagging doesn’t have this problem.<br />
<br />
• In practice bagging almost always helps.<br />
<br />
• On average, boosting usually helps more than bagging, but it is also more common for boosting to hurt performance.<br />
<br />
• The weights grow exponentially.<br />
<br />
• Bagging is easier to parallelize.<br />
<br />
== Decision Trees ==<br />
<br />
<br />
[[File: simple_decision_tree.jpg|right|frame|A basic example of a decision tree, iteratively ask questions to navigate the tree until we reach a decision node.]]<br />
<br />
'''Decision tree learning''' is a method commonly used in statistics, data mining and machine learning. The goal is to create a model that predicts the value of a target variable based on several input variables. It is a very flexible classifier, can classify non-linear data and it can be used for classification, regression, or both. A tree is usually used as a visual and analytical decision support tool, where the expected values of competing alternatives are calculated. <br />
<br />
<br />
It uses principle of divide and conquer for classification. The trees have traditionally been created manually. Trees map features of a decision problem onto a conclusion, or label. We fit a tree model by minimizing some measure of impurity. For a single covariate <math>\ X_1 </math> we choose a point t on the real line that splits the real line into two sets <math>\ R_1 = (-\infty, t] , R_2 = [ t, \infty) </math> in a way that minimizes impurity.<br /><br />
<br />
[[File: p.jpg|right|frame|Node impurity for two-class classification, as a function of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5,0.5).]]<br />
<br />
Let <math>\hat{p_s}(j) </math> be the proportion of observations in <math>\boldsymbol R_s </math> such that <math>\ Y_i = j</math> <br /><br />
<br />
<math>\hat{p_s}(j) = \frac {\sum_{i=1}^n I(Y_i = j, X_i \in \boldsymbol R_s)}{\sum_{i=1}^n I(x_i \in \boldsymbol R_s)}</math><br /><br />
<br />
<br />
Node impurity measures (see figure to the right):<br />
<br />
:Misclassification error: <math>\ 1 - \hat{p_s}(j) </math><br /><br />
:Gini index:<math>\sum_{j \neq 1} \hat{p_s}(j)\hat{p_s}(i)</math><br />
<br />
'''Limitions in Decision Trees'''<br />
<br />
1. Overfitting problem:<br />
Decision Trees are extremely flexible models; this flexibility means that they can easily perfectly match any training set. This makes overfitting a prime consideration when training a decision tree. There is no robust way to avoid fitting noise in the data but two common approaches include:<br />
<br />
* do not fit all trees, stop when the training set reaches perfection<br />
* fully grow the tree and then prune the resulting tree. Pruning algorithms include cost complexity pruning, minimum description length pruning and pessimistic pruning. This results in a tree with less branches, which can generalize better. <ref>J. R. Quinlan, Decision Trees and Decision Making, IEEE Transactions on Systems, Man and Cybernetics, vol 20, no 2, March/April 1990, pg 339-346.</ref><br />
<br />
<br />
2. time-consuming and complex: <br />
compare to other decision-making models, decision trees is a relatively easier tool to use, however, if the tree contains a large amount branches, it will become complex in nature and take time to solve the problem. <br />
Moreover, decision trees only examine a single field at a time, which leads to rectangular classification boxes. And the complexity adds costs to train people to have the extensive knowledge to complete the decision tree analysis. <ref><br />
http://www.brighthub.com/office/project-management/articles/106005.aspx<br />
</ref><br />
<br />
<br />
Some specific decision-tree algorithms:<br />
* ID3 algorithm [http://en.wikipedia.org/wiki/ID3_algorithm]<br />
* C4.5 algorithm [http://en.wikipedia.org/wiki/C4.5_algorithm]<br />
* C5 algorithm<br />
<br />
A comparison of bagging and boosting methods using the decision trees classifiers: [http://www.doiserbia.nb.rs/img/doi/1820-0214/2006/1820-02140602057M.pdf]<br />
<br />
=== CART (Classifcation and Regression Tree)===<br />
<br />
The '''Classification and Regression Tree (CART)''' is a non-parametric Decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. (Wikipedia) The CART is good in working with outliers during the process. CART will isolate the outliers in a separate node.<br />
<br />
Advantages<ref>http://www.statsoft.com/textbook/classification-and-regression-trees/</ref>:<br />
* '''Simplicity of results'''. In most cases the results are summarized in a very simple tree. This is important for fast classification and for creating a simple model for explaining the observations.<br />
* '''Tree methods are nonparametric and nonlinear'''. There is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable is linear or monotonic. Thus tree methods are well suited to data mining tasks where there is little a priori knowledge of any related variables.<br />
<br />
===Advantages and Disadvantages===<br />
<br />
Decision Tree Advantages <br />
<br />
1. Easy to understand <br />
<br />
2. Map nicely to a set of business rules <br />
<br />
3. Applied to real problems <br />
<br />
4. Make no prior assumptions about the data <br />
<br />
5. Able to process both numerical and categorical data <br />
<br />
Decision Tree Disadvantages <br />
<br />
1. Output attribute must be categorical <br />
<br />
2. Limited to one output attribute <br />
<br />
3. Decision tree algorithms are unstable <br />
<br />
4. Trees created from numeric datasets can be complex<br />
<br />
Read more: http://wiki.answers.com/Q/List_the_advantages_and_disadvantages_for_both_decision_table_and_decision_tree#ixzz1dNGFaOpi<br />
<br />
===Ranking Features===<br />
In implementation of a tree model it is important how the features are ranked (i.e. in what order the features appear in the tree). The general way to do this is to choose the features with the highest dependence on Y to be the first feature in the tree and then going down the tree with lower dependence.<br />
<br />
'''Feature ranking strategies'''<br />
<br />
1. Fisher score (F-score)<br />
* simple in nature<br />
* efficient in measuring the the discrimination between a feature and the label.<br />
* independent of the classifier.<br />
<br />
2 Linear SVM Weight<br />
<br />
The following is an algorithm based on linear SVM weights:<br />
<br />
* input the training sets: <math>(x_i, y_i), i = 1, \dots l</math> <br />
* obtain the sorted feature ranking list as output:<br />
** Using grid search to find the best parameter C. <br />
** Training a <math>L2-</math>loss linear SVM model using the best available C.<br />
** Then features can be sorted according to the absolute values of weights.<br />
<br />
3. Change of AUC with/without Removing Each Feature<br />
<br />
4. Change of Accuracy with/without Removing Each Feature<br />
<br />
5. Normalized [http://en.wikipedia.org/wiki/Information_gain Information Gain] (difference in entropy)<br />
<br />
note: for details, please read <ref><br />
http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf<br />
</ref><br />
<br />
===Random Forest=== <br />
Decision trees are unstable. An application of bagging is to combine trees into random forest. A random forest is a classifier consisting of a collection of tree-structured classifiers <math>\left \lbrace \ h(x, \Theta_k ), k = 1, . . . \right \rbrace</math> where the <math>{\Theta_k } </math> are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input <math>x</math> <ref>Breiman N., Random Forests ''Machine learning'' [http://www.springerlink.com/content/u0p06167n6173512/fulltext.pdf]</ref>.<br />
<br />
In a random forest, the trees are grown quite similarly to the standard classification tree. However, no pruning is done in the random forest technique. <br />
<br />
Compared with other methods, random forests have some positive characteristics:<br />
<br />
* runs faster than bagging or boosting<br />
* has similar accuracy as Adaboost, and sometimes even better than Adaboost<br />
* relatively robust to noise<br />
* delivers useful estimates of error, correlation<br />
<br />
For larger data sets, more accuracy can be obtained by combining random features with boosting.<br />
<br />
'''This is how a single tree is grown:'''<br />
<br />
First, suppose the number of elements in the training set is K. We then sample K elements with replacement. <br />
Second, if there are a total of N inputs to the tree, choose an integer n << N such that for each node of the tree, n variables are randomly selected from N. the best split on these n variables is used to allow the node to make a decision (hence a "decision tree"). <br />
Third, grow the tree as large as possible. <br />
<br />
Each tree contributes one classification. That is, each tree gets one "vote" to classify an element. The beauty of random forest is that all of these votes are added up, similar to boosting, and the final decision is the result of the vote. This is an extremely robust algorithm. <br />
<br />
There are two things that can contribute to error in random forest: <br />
<br />
1. correlation between trees <br />
2. the ability of an individual tree to classify well. <br />
<br />
This is seen intuitively, since if many trees are very similar to one another, then it is likely they will all classify the elements in the same way. If a single tree is not a very good classifier, it does not matter in the long run because the other trees will compensate for its error. However, if many trees are bad classifiers, the result will be garbage.<br />
<br />
To avoid both of the above problems, there is an algorithm to optimize n, the number of variables to use in each decision tree. Unfortunately, an optimal value is not found on its own; instead, an optimal range is found. Thus, to properly program a random forest, there is a parameter that must be "tuned". Looking at various types of error rate, this is easily found (we want to minimize error, as characterized by the Gini index, or the misclassification rate, or the entropy). [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro]<br />
<br />
An algorithm for the Random Forest can be described as below: we first let <math>N_trees</math> to be the number of trees need to build for each of <math>N_trees</math> iterations, then we select a new bootstrap sample from training set and grow an un-pruned tree on this bootstrap, next, at each internal node, randomly select m predictors and determine the best split using only these predictors. Finally do not perform cost complexity pruning and save tree as is, along side those built thus far. <ref><br />
Albert A. Montillo,Guest lecture: Statistical Foundations of Data Analysis "Random Forests", April,2009. <http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf><br />
</ref><br />
<br />
===Further Reading===<br />
<br />
Boosting: <ref>Chunhua Shen; Zhihui Hao. “A direct formulation for totally-corrective multi-class boosting”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2011.</ref><br />
<br />
Bagging: <ref>Xiaoyuan Su; Khoshgoftarr, T.M.; Xingquan Zhu. “VoB predictors: Voting on bagging classifications”. 19th IEEE International Conference on Pattern Recognition. 2008.</ref><br />
<br />
Decision Tree: <ref> Zhuowen Tu. “Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering”. Tenth IEEE International Conference on Computer Vision. 2005.</ref><br />
<br />
== Graphical Models ==<br />
<br />
A graphical model is a probabilistic model for which a graph denotes the conditional independence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.(Wikipedia)<br />
<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
<br />
<br />
{| class="wikitable"<br />
|-<br />
! Y<br />
! Y<br />
|-<br />
| <math>\downarrow</math><br />
| <math>\uparrow</math><br />
|-<br />
| Generative LDA<br />
| Linear Discrimanation<br />
|}<br />
<br />
Probabilistic ''Discriminative'' Models: Model posterior probability P(Y|X) directly (example: LDA).<br />
<br />
Advantages of discriminative models<br />
* Obtain desired posterior probability directly<br />
* Less parameters<br />
<br />
''Generative'' Model: Compute posterior probabilities using Bayes Rule - class-conditional densities and class priors. <ref>http://www.google.ca/imgres?q=generative+vs+discriminative+model&hl=en&client=firefox-a&hs=9tQ&sa=X&rls=org.mozilla:en-GB:official&biw=1454&bih=840&tbm=isch&prmd=imvns&tbnid=GZd3ZvkGOWmvnM:&imgrefurl=https://liqiangguo.wordpress.com/2011/05/26/discriminative-model-vs-generative-model/&docid=9D6p6EAceYNlSM&imgurl=http://liqiangguo.files.wordpress.com/2011/05/d_g1.jpg&w=938&h=336&ei=4pjBTrmjOqHc0QG-u_WCAw&zoom=1&iact=hc&vpx=369&vpy=193&dur=203&hovh=72&hovw=202&tx=162&ty=89&sig=116704843266645309182&page=1&tbnh=72&tbnw=202&start=0&ndsp=25&ved=1t:429,r:1,s:0</ref><br />
<br />
Advantages of generative models:<br />
*Can generate new points<br />
*Can sample a new point<br />
<br />
for an introduction to Graphical model you can see: [http://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf]<br />
<br />
=Boltzmann Machines=<br />
<br />
==Introduction==<br />
<br />
[[Image:GBMRBM.jpg|thumb|200px|right|Reference: [2]]]<br />
<br />
Boltzmann machines are networks of connected nodes which, using a stochastic decision-making process, decide to be on or off. These connections need not be directive; that is, they can go back and forth between layers. This type of formulation leads the reader to think immediately of a binomial distribution, with some probability p of each node being on or off. In a classification problem, a Boltzmann Machine is presented with a set of binary vectors, each entry of the vector called a “unit”, with the goal of learning to generate these vectors. [1]<br />
<br />
Similar to the neural networks already discussed in class, a Boltzmann Machine must assign weights to inputs, compute some combination of the weights times contributing node values, and optimize the weights such that a certain cost function (such as the relative entropy, as discussed later) is minimized. The cost function depends on the complexity of the model and the “correctness” of the classification. The main idea is to make small updates in the connection weights iteratively.<br />
<br />
Boltzmann Machines are often used in generative models. That is, we start with some process seen in real life and try to reproduce it, with a goal of predicting future behaviour of the system by generating from the probability distribution created by the Boltzmann Machine.<br />
<br />
==How a Boltzmann Machine Works==<br />
<br />
Suppose we start with a pattern ɣ that represents some real life dynamical system. The true probability distribution function of this system is f_ɣ. For each element in the vectors associated with this system, we create a visible unit in the Boltzmann Machine whose function is directly related to the value of that element. Then, usually, to capture higher order regularities in the pattern, we create hidden units (similar to Feed-Forward Neural Networks). Sometimes researchers choose not to use hidden units, but this leads to a lack of ability to learn high order regularity [5]. There are two possible values for each node in the Boltzmann Machine: “on” or “off”. There is a difference in energy between these states. Each node must then compute the difference in energy to see which state would be more favourable. This difference is called the “energy gap”. <br />
<br />
Each node of the Boltzmann Machine is presented an opportunity to update its status. When a set of input vectors is shown to the layer, a computation takes place within each node to decide to convert to “on” or to remain “off”. The computation is as follows:<br />
<br />
<math> \Delta E_i = E_{-1} - E_{+1} = \sum_j w_{ij}S_j </math> <br />
<br />
Where <math> w_{ij} </math> represents the weight between nodes i and j, and <math> S_j </math> is the state of the jth component. <br />
<br />
Then the probability that the node will adopt the “on” state is:<br />
<br />
<math> P(+1) = \frac{1}{(1 + exp( \frac{\delta E}{T}))} </math><br />
<br />
Where T is the temperature of the system. The probability of any vector v being an output of the system is just the energy of the vector divided by the total energy of the system, or<br />
<br />
<math> P(v) = \frac{e^{-E(v)}}{e^{E(system)}} </math><br />
<br />
And the energy of a vector is defined as:<br />
<br />
<math> E({v}) = -\sum_i s^{v}_i b_i -\sum_{i<j} s^{v}_i s^{v}_j w_{ij} </math> [1]<br />
<br />
Simulated annealing, a method to improve the search for a global minimum, is being used here. It may not succeed in finding the global minimum on its own [3]. This may be a foreign concept to statisticians. For more information, consult [6] and [7]. The state gets changed to whichever calculation in the logistic function step yields a decrease in energy. <br />
<br />
Eventually, through learning, the Boltzmann Machine will reach an equilibrium state, much like a Markov Chain. This equilibrium state will have a low temperature. Once equilibrium has been reached, we can estimate the probability distribution across the nodes of the Boltzmann Machine. Using this information, we can model how the dynamical system will behave in the long run. <br />
<br />
Since the system is in equilibrium, we can use the mean value of each visible unit to build a probability model. We wouldn’t want to do these calculations before reaching equilibrium, because they would not be representative of the long-term behaviour of the system. Let this measured distribution be denoted f_δ. Then we are interested in measuring the difference between the true distribution and this measured distribution. <br />
<br />
There are several different methods that can be used to compare distributions. One that is commonly used is the relative entropy:<br />
<br />
<math> G(f_\gamma ||f_\delta) = \sum_\gamma f_\gamma ln(\frac{f_\gamma}{f_\delta}) </math> [5]<br />
<br />
We want to minimize this distance, since we want the measured distribution to be as close as possible to the true distribution<br />
<br />
==Learning in Boltzmann Machines==<br />
<br />
The Two-Phase Method<br />
<br />
Boltzmann Machines using hidden units are very robust tools. Visible units are coupled, leading to a problem when trying to capture the effects of higher-dimensional regularities. When hidden units are introduced, the system has the ability to define and use these regularities.<br />
<br />
One approach to learning Boltzmann Machines is discussed thoroughly in [5]. To summarize, this approach makes use of two phases. <br />
<br />
Phase 1: Fix all visible units. Allow the hidden units to change as necessary to obtain equilibrium. Then, look at pairs of units. If two elements of a pair are both “on”, then increment the weight associated with them. So this phase consists entirely of “learning”. There is no control for spurious data.<br />
<br />
Phase 2: No units are fixed. Allow all units to change as necessary to obtain equilibrium. Then sample the final equilibrium distribution to find reliable averages of the term s¬_i s_j. Then as before, look for pairs of units that are both “on”, and decrement the weight associated with them. So this is the phase in which spurious data are eliminated.<br />
<br />
Alternate between these two phases. Eventually, the equilibrium distribution will be reached and we see that <math> “ \frac{\partial {G}}{\partial {w_{ij}}} = \frac{-1}{T} (<s_i s_j>^{+} - <s_i s_j>^{-}) </math> where <math> s_i s_j </math> are the probabilities of finding units i and j both “on”, when the network is ‘fixed’ and ‘free-running’, respectively” [5]. <br />
Another method, for learning Deep Boltzmann Machines, is presented in [2].<br />
<br />
==Pros and Cons of using Boltzmann Machines==<br />
<br />
Pros<br />
<br />
* More accurate than backpropagation [5]<br />
* Bayesian interpretation of how good a model is [5]<br />
<br />
Cons<br />
<br />
* Very slow, because of nested loops necessary to perform phases [5]<br />
<br />
There are many topics on which this discussion could be expanded. For example, we could get into a more in-depth discussion of simulated annealing, or look at Restricted Boltzmann Machines (RBMs) for deep learning, or different methods of learning and different measures of error. Another interesting topic would be a discussion on mean field approximation of Boltzmann Machines, which supposedly runs faster.<br />
<br />
*The time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths<br />
*Connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.<br />
References: <br/><br />
[1] http://www.scholarpedia.org/article/Boltzmann_machine <br/><br />
[2] http://www.mit.edu/~rsalakhu/papers/dbm.pdf <br/><br />
[3] http://mathworld.wolfram.com/SimulatedAnnealing.html <br/><br />
[4] http://waldron.stanford.edu/~jlm/papers/PDP/Volume%201/Chap7_PDP86.pdf <br/><br />
[5] http://cs.nyu.edu/~roweis/notes/boltz.pdf <br/><br />
[6] http://neuron.eng.wayne.edu/tarek/MITbook/chap8/8_3.html <br/><br />
[7] Bertsimas and Tsitsiklis. Simulated Annealing. Statistical Science. 1993. Vol. 8, No. 1, 10 – 15. <br/><br />
<br />
==References==<br />
<references /></div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=45328stat841f112020-11-18T16:28:51Z<p>Gtompkin: /* Apply KKT conditions */</p>
<hr />
<div>== [[stat841f14 | Data Visualization (Stat 442 / 842, CM 762 - Fall 2014) ]] ==<br />
== Archive ==<br />
==[[f11Stat841proposal| Proposal for Final Project]]==<br />
==[[f11Stat841presentation| Presentation Sign Up]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
<br />
===Introduction===<br />
''Machine learning'' (ML) methodology in general is an artificial intelligence approach to establish and train a model to recognize the pattern or underlying mapping of a system based on a set of training examples consisting of input and output patterns. Unlike in classical statistics where inference is made from small datasets, machine learning involves drawing inference from an overwhelming amount of data that could not be reasonably parsed by manpower.<br />
<br />
In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance), according to some specific algorithm. The approach of using examples to produce the output labels is known as ''learning methodology''. When the underlying function from inputs to outputs exists, it is referred to as the target function. The estimate of the target function which is learned or output by the learning algorithm is known as the solution of learning problem. In case of classification this function is referred to as the ''decision function''. <br />
<br />
In the broadest sense, any method that incorporates information from training samples in the design of a classifier employs learning. Learning tasks can be classified along different dimensions. One important dimension is the distinction between supervised and unsupervised learning. In supervised learning a category label for each pattern in the training set is provided. The trained system will then generalize to new data samples. In unsupervised learning , on the other hand, training data has not been labeled, and the system forms clusters or natural grouping of input patterns based on some sort of measure of similarity and it can then be used to determine the correct output value for new data instances. <br />
<br />
The first category is known as ''pattern classification'' and the second one as ''clustering''. Pattern classification is the main focus in this course. <br />
<br />
<br />
'''Classification problem formulation ''': Suppose that we are given ''n'' observations. Each observation consists of a pair: a vector <math>\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n</math>, and the associated label <math>y_i</math>.<br />
Where <math>\mathbf{x}_i = (x_{i1}, x_{i2}, ... x_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> and <math>Y_i</math> belongs to some finite set <math>\mathcal{Y}</math>.<br />
<br />
The classification task is now looking for a function <math>f:\mathbf{x}_i\mapsto y</math> which maps the input data points to a target value (i.e. class label). Function <math>f(\mathbf{x},\theta)</math> is defined by a set of parametrs <math>\mathbf{\theta}</math> and the goal is to train the classifier in a way that among all possible mappings with different parameters the obtained decision boundary gives the minimum classification error.<br />
<br />
=== Definitions ===<br />
<br />
The '''true error rate''' for classifier <math>h</math> is the error with respect to the unknown underlying distribution when predicting a discrete random variable Y from a given input X.<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
The '''empirical error rate''' is the error of our classification function <math>h(x)</math> on a given dataset with known outputs (e.g. training data, test data)<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
where h is a clssifier<br />
and <math>\mathbf{I}()</math> is an indicator function. The indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. misclassification)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
<br />
For example, suppose we have 100 new data points with known (true) labels<br />
<br />
<math>X_1 ... X_{100}</math><br />
<math>y_1 ... y_{100}</math><br />
<br />
To calculate the empirical error, we count how many times our function <math>h(X)</math> classifies incorrectly (does not match <math>y</math>) and divide by n=100.<br />
<br />
=== Bayes Classifier ===<br />
The principle of the Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes' Rule, and then assign the object to the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either one of the following:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution. However, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution (including the mean, variance, etc.) is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : The Prior, probability of Y taking the value chosen<br />
<br />
denominator : Equivalent to P(X=x), for all values of Y, normalizes the probability <br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
'''Theorem''': The Bayes Classifier is optimal, i.e., if <math>h</math> is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
<br />
'''Proof''': Consider any classifier <math>h</math>. We can express the error rate as <br />
<br />
::<math> P( \{h(X) \ne Y \} ) = E_{X,Y} [ \mathbf{1}_{\{h(X) \ne Y \}} ] = E_X \left[ E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] \right] </math><br />
<br />
To minimize this last expression, it suffices to minimize the inner expectation. Expanding this expectation:<br />
<br />
::<math> E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] = \sum_{y \in Supp(Y)} P( h(X) \ne y | X) \mathbf{1}_{\{h(X) \ne y \} } </math><br />
which, in the two-class case, simplifies to<br />
<br />
::::<math> = P( h(X) \ne 0 | X) \mathbf{1}_{\{h(X) \ne 0 \} } + P( h(X) \ne 1 | X) \mathbf{1}_{\{h(X) \ne 1 \} } </math><br />
::::<math> = r(X) \mathbf{1}_{\{h(X) \ne 0 \} } + (1-r(X))\mathbf{1}_{\{h(X) \ne 1 \} } </math><br />
<br />
where <math>r(x)</math> is defined as above. We should 'choose' h(X) to equal the label that minimizes the sum. Consider if <math>r(X)>1/2 </math>, then <math>r(X)>1-r(X)</math> so we should let <math>h(X) = 1</math> to minimize the sum. Thus the Bayes classifier is the optimal classifier. <br />
<br />
Why then do we need other classification methods? Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
<br />
<math>f_k(x)</math> is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we must rely on some data to estimate these quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., linear, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) the true error, L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=y) = (1/n) \sum_{i=1}^{n} I(Y_i = y)</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to use ''density estimation'', but the main problem lies with high dimensional spaces, as the estimation results may have a high error rate and sometimes estimation may be infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the data points required for learning increases exponentially.<br />
<br />
To learn more about methods for handling high-dimensional data see <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
The third approach is the simplest.<br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r(x) \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
There are also some interesting reads on Bayes Classification:<br />
* http://esto.nasa.gov/conferences/estc2004/papers/b8p4.pdf (NASA)<br />
* http://www.cmla.ens-cachan.fr/fileadmin/Membres/vachier/Garcia6812.pdf (application to medical images)<br />
* http://www.springerlink.com/content/g221vh5m6744362r/ (Journal of Medical Systems)<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} \text{exp}\big(-\frac{1}{2}(\mathbf{x-\mu_k}) \Sigma_k^{-1}(\mathbf{x-\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
<math> \frac{1}{(2\pi)^{d/2} |\Sigma_1|^{1/2}} exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma_1^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma_0|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_0}) \Sigma_0^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
<math> \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x- \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
3) Cancel <math>(2\pi)^{-d/2} |\Sigma|^{-1/2}</math> from both sides.<br />
<br />
<br />
<math> exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = exp(-\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0</math><br />
<br />
4) Take log of both sides.<br />
<br />
<math> -\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )+ \text{log}(\pi_1) = -\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )+ \text{log}(\pi_0)</math><br />
<br />
5) Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-\frac{1}{2}(\mathbf{x - \mu_1})^T \Sigma^{-1} (\mathbf{x-\mu_1}) + \text{log}(\pi_1) - [-\frac{1}{2}(\mathbf{x - \mu_0})^T \Sigma^{-1} (\mathbf{x-\mu_0}) + \text{log}(\pi_0)] = 0 </math><br />
<br />
<br />
<math>\frac{1}{2}[-\mathbf{x}^T \Sigma^{-1}\mathbf{x - \mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ \text{log}(\frac{\pi_1}{\pi_0}) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>\frac{1}{2}[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ \text{log}(\frac{\pi_1}{\pi_0}) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(\mathbf{x}) = Pr(X=\mathbf{x}|Y=k)</math> is Gaussian. The Bayes Classifier is<br />
:<math>\,h^*(\mathbf{x}) = \arg\max_{k} \delta_k(\mathbf{x})</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(\mathbf{x}) = \mathbf{x}^\top\Sigma^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k^\top\Sigma^{-1}\boldsymbol{\mu}_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) = (\mathbf{x}-\boldsymbol{\mu}_k)^\top(\mathbf{x}-\boldsymbol{\mu}_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,\mathbf{x}</math> and <math>\,\boldsymbol{\mu}_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)&= (\mathbf{x}-\boldsymbol{\mu}_k)^\top U_kS_k^{-1}U_k^T(\mathbf{x}-\boldsymbol{\mu}_k)\\<br />
& = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-1}(U_k^\top \mathbf{x}-U_k^\top \boldsymbol{\mu}_k)\\<br />
& = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate some parameters. Here is a comparison between the number of parameters needed to be estimated for LDA and QDA:<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters. Thus QDA may suffer much more extremely from the curse of dimensionality.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
In this approach the feature vector is augmented with the quadratic terms (i.e. new dimensions are introduced) where the original data will be projected to that dimensions. We then apply LDA on the new higher-dimensional data. <br />
<br />
The motivation behind this approach is to take advantage of the fact that fewer parameters have to be calculated in LDA , as explained in previous sections, and therefore have a more robust system in situations where we have fewer data points.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we have a quadratic function to estimate: <math>g(\mathbf{x}) = y = \mathbf{x}^T\mathbf{v}\mathbf{x} + \mathbf{w}^T\mathbf{x}</math>.<br />
<br />
Using this trick, we introduce two new vectors, <math>\,\hat{\mathbf{w}}</math> and <math>\,\hat{\mathbf{x}}</math> such that:<br />
<br />
<math>\hat{\mathbf{w}} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]^T</math><br />
<br />
and<br />
<br />
<math>\hat{\mathbf{x}} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]^T</math><br />
<br />
We can then apply LDA to estimate the new function: <math>\hat{g}(\mathbf{x},\mathbf{x}^2) = \hat{y} =\hat{\mathbf{w}}^T\hat{\mathbf{x}}</math>.<br />
<br />
Note that we can do this for any <math>\, x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Note, we are not applying QDA, but instead extending LDA to calculate a non-linear boundary, that will be different from QDA. This algorithm is called nonlinear LDA.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_d </math>, to approximate the space spanned by the original data points <math>\ \mathbf{x}=[x_1, x_2, ... , x_D]^T </math>. We can choose d based on what percentage of the variance of the original data we would like to maintain. <br />
<br />
The first PC, <math>\ \mathbf{u}_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data. The second PC, <math>\ \mathbf{u}_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ \mathbf{u}_D </math> which has the minimum variance.<br />
<br />
Let <math>u_i = \mathbf{w}^T\mathbf{x_i}</math> be the projection of the data point <math>\mathbf{x_i}</math> on the direction of '''w''' if '''w''' is of length one.<br />
<br />
<br />
<math>\mathbf{u = (u_1,....,u_D)^T}\qquad</math> , <math>\quad\mathbf{w^Tw = 1 }</math><br />
<br />
<br />
<math>var(u) =\mathbf{w}^T X (\mathbf{w}^T X)^T = \mathbf{w}^T X X^T\mathbf{w} = \mathbf{w}^TS\mathbf{w} \quad </math> <br />
Where <math>\quad X X^T = S </math> is the sample covariance matrix.<br />
<br />
<br />
<br />
We would like to find the <math>\ \mathbf{w} </math> which gives us maximum variation:<br />
<br />
<math>\ \max (Var(\mathbf{w}^T \mathbf{x})) = \max (\mathbf{w}^T S \mathbf{w}) </math> <br />
<br />
<br />
Note: we require the constraint <math>\ \mathbf{w}^T \mathbf{w} = 1 </math> because if there is no constraint on the length of <math>\ \mathbf{w} </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(\mathbf{w}, \lambda) = \mathbf{w}^T S\mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial \mathbf{w}}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2S\mathbf{w} - 2 \lambda \mathbf{w} = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle S\mathbf{w} = \lambda \mathbf{w}</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle S\mathbf{w}= \lambda \mathbf{w} </math> , and <math>\displaystyle \mathbf{w}^T \mathbf{w}=1</math> , then we can write<br />
<br />
<math>\displaystyle \mathbf{w}^T S\mathbf{w}= \mathbf{w}^T\lambda \mathbf{w}= \lambda \mathbf{w}^T \mathbf{w} =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math> ---- (S is a co-variance matrix, and therefore it's symmetric)<br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(\mathbf{w}^\top \mathbf{w}) = \mathbf{w}^\top S \mathbf{w}= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ \mathbf{w}</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====PCA Implementation Using Singular Value Decomposition====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." <ref> Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf </ref><br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
=== Dual PCA ===<br />
<br />
Singular value decomposition allows us to formulate the principle components algorithm entirely in terms of dot products between data points and limit the direct dependence on the original dimensionality ''d''. Now assume that the dimensionality ''d'' of the ''d × n'' matrix of data X is large (i.e., ''d >> n''). In this case, the algorithm described in previous sections become impractical. We would prefer a run time that depends only on the number of training examples ''n'', or that at least has a reduced dependence on ''n''.<br />
Note that in the SVD factorization <math>\ X = U \Sigma V^T </math>, the eigenvectors in <math>\ U </math> corresponding to non-zero singular values in <math>\ \Sigma </math> (square roots of eigenvalues) are in a one-to-one correspondence with the eigenvectors in <math>\ V </math> .<br />
After performing dimensionality reduction on <math>\ U </math> and keeping only the first ''l'' eigenvectors, corresponding to the top ''l'' non-zero singular values in <math>\ \Sigma </math>, these eigenvectors will still be in a one-to-one correspondence with the first ''l'' eigenvectors in <math>\ V </math> : <br />
<br />
<math>\ X V = U \Sigma </math><br />
<br />
<math>\ \Sigma </math> is square and invertible, because its diagonal has non-zero entries. Thus, the following conversion between the top ''l'' eigenvectors can be derived:<br />
<br />
<math>\ U = X V \Sigma^{-1} </math><br />
<br />
Now Replacing <math>\ U </math> with <math>\ X V \Sigma^{-1} </math> gives us the dual form of PCA.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011 - Oct. 04, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
<br />
=== PCA ===<br />
<br />
- Find a rank d subspace which minimize the squared reconstruction error:<br />
<br />
<math> \Sigma = |x_i - \hat{x} |^2</math><br />
<br />
where <math>\hat{x} </math> is projection of original data.<br />
<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not produce the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lighting filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. As another example , if we imagine 2 cigar like clusters in 2 dimensions, one cigar has <math>y = 1</math> and the other <math>y = -1</math>. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space.) See figure below <ref>www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf</ref>. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''.<br />
The main difference between FDA and PCA is that, in PCA we are interested in transforming the data to a new coordinate system such that the greatest variance of data lies on the first coordinate, but in FDA, we project the data of each class onto a point in such a way that the resulting points would be as far apart from each other as possible. The FDA goal is achieved by projecting data onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
[[File:Classification.jpg | Two cigar distributions where the direction of greatest variance is not the most useful for classification]]<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension.<br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
First, we need to reduce the dimensionality of covariate to one dimension (two-class case) by projecting the data onto a line. That is take the d-dimensional input values x and project it to one dimension by using <math>z=\mathbf{w}^T \mathbf{x}</math> where <math>\mathbf{w}^T </math> is 1 by d and <math>\mathbf{x}</math> is d by 1.<br />
<br />
Goal: choose the vector <math>\mathbf{w}=[w_1,w_2,w_3,...,w_d]^T </math> that best seperate the data, then we perform classification with projected data <math>z</math> instead of original data <math>\mathbf{x}</math> .<br />
<br />
<br />
<math>\hat{{\mu}_0}=\frac{1}{n_0}\sum_{i:y_i=0} x_i</math><br />
<br />
<math>\hat{{\mu}_1}=\frac{1}{n_1}\sum_{i:y_i=1} x_i</math><br />
<br />
<math>\mathbf{x}\rightarrow\mathbf{w}^{T}\mathbf{x}</math>. <br /><br />
<math>\mathbf{\mu}\rightarrow\mathbf{w}^{T}\mathbf{\mu}</math>.<br /><br />
<math>\mathbf{\Sigma}\rightarrow\mathbf{w}^{T}\mathbf{\Sigma}\mathbf{w}</math> <br /><br />
<br />
<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_0 \mathbf{w}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} ^{T}\mathbf{\Sigma}_0 \mathbf{w} + \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ^{T}( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}</math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w} \mathbf{w}^{T} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \; \alpha \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}</math> <br /><br />
(optimized function)<br />
<br />
subject to:<br />
<br />
:: <math> {\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} = 1 </math><br /><br />
(constraint)<br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \lambda(\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ \mathbf{w} </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the case of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br /><br />
(this equation indicates the direction of the separation).<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>, where <math>\mathbf{W}</math> is a d*(k-1) matrix if X is in d-dim. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1 + \dots + \mathbf{w}_k^{T} \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
The first (k-1) eigenvector of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math> are required (k-1) direction. That is why under multiclass case, for the k-class problem, we need to project initial points onto k-1 direction.<br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = 1 </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} </math> <br />
<br />
where <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math> is called the within class variance and <math>\ \mathbf{S}_B = \mathbf{S}_T - \mathbf{S}_W </math> is called the between class variance where <math>\mathbf{S}_T </math> is the variance of all the data together.<br />
<br />
Every column of <math> \mathbf{w} </math> is parallel to a single eigenvector.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w}^{T} \mathbf{S_W} \mathbf{w} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
=== Optical Character Recognition (OCR) using FDA ===<br />
Optical Character Recognition (OCR) is a method to translate scanned, human-readable text into machine-encoded text. In class, we have employed FDA to recognize digits. A paper <ref>Manjunath Aradhya, V.N., Kumar, G.H., Noushath, S., Shivakumara, P., "Fisher Linear Discriminant Analysis based Technique Useful for Efficient Character Recognition", Intelligent Sensing and Information Processing, 2006.</ref> describes the use of FDA to recognize printed documents written in English and Kannada, the fifth most popular language in India. The researchers conducted two types of experiments: one on printed Kannada and English documents and another on handwritten English characters. In the first type of experiments, they conducted four experiments: i) clear and degraded characters in specific fonts; ii) characters in various size; iii) characters in various fonts; iv) characters with noise. In experiment i, FDA achieved 98.2% recognition rate with 12 projection vectors in 21,560 samples. In experiment ii, it achieved 96.9% recognition rate with 10 projection vectors in 11,200 samples. In experiment iii, it achieved 93% recognition rate with 17 projection vectors in 19,850 samples. In experiment iv, it achieved 96.3% recognition rate with 14 projection vectors in 20,000 samples. Overall, the recognition by FDA was very satisfying. In the second type of experiment, a total of 12,400 handwriting samples from 200 different writers were collected. With 175 samples for training purpose, the recognition rate by FDA is 92% with 35 projection vectors.<br />
<br />
=== Facial Recognition using FDA ===<br />
<br />
The Fisherfaces method of facial recognition uses PCA and FDA in a similar way to using just PCA. However, it is more advantageous than using on PCA because it minimizes variation within each class and maximizes class separation. The PCA only method is, therefore, more sensitive to lighting and pose variations. In studies done by Belhumeir, Hespanda, and Kiregeman (1997) and Turk and Pentland (1991), this method had a 96% recognition rate. <ref>Bagherian, Elham. Rahmat, Rahmita. Facial Feature Extraction for Face Recognition: a Review. International Symposium on Information Technology, 2008. ITSim2 article number 4631649.</ref><br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
Both Regression and Classification are aimed to find a function h which maps data X to feature Y. In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. In linear regression, data is modeled using a linear function, and unknown parameters are estimated from the data. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
When running a regression we are making two assumptions, <br />
<br />
# A linear relationship exists between two variables (i.e. X and Y) <br />
# This relationship is additive (i.e. <math>Y= f_1(x_1) + f_2(x_2) + …+ f_n(x_n)</math>). Technically, linear regression estimates how much Y changes when X changes one unit. <br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>. For the linear model, we assume that either the regression function <math>r(\mathbf{x})</math> is linear, or the linear model has a reasonable approximation.<br />
<br />
Here is a simple example. If <math>\ Y = \{0,1\}</math> (a two-class problem), then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{, if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{, otherwise} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = y_i = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> , <math>\mathbf{x_{i}} \in \mathbb{R}^{d}</math><br />
and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> can be combined into a d+1 dimensional vector, <math>\tilde{\mathbf{\beta}}</math>. The term ''1'' is appended to <math>\ x </math>. Thus, the function to be minimized can now be re-expressed as:<br />
<br />
<math>\ LS = \min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta}^T \tilde{x_i} )^2 </math><br />
<br />
<math>\ LS = \min_{\tilde{\beta}} || y - X \tilde{\beta} ||^2 </math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right) \in \mathbb{R}^{d+1}</math> and <br />
<br />
<math>\tilde{x} = \left( \begin{array}{c}{x_{1}}<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
{x}_{d} \\ \\<br />
1 \end{array} \right) \in \mathbb{R}^{d+1}</math>.<br />
<br />
where <math>\tilde{\mathbf{\beta}}</math> is a d+1 by 1 matrix(a d+1 dimensional vector)<br />
<br />
Here <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a n by d+1 matrix with each row represents a data point with a 1 as the last entry. X also can be seen as a matrix<br />
in which each column represents a feature and the <math>\ (d+1)^{th} </math> column is an all-one vector corresponding to <math>\ \beta_0 </math> .<br />
<br />
<math>\ {\tilde{\beta}}</math> that minimizes the error is:<br />
<br />
<math>\ \frac{\partial LS}{\partial \tilde{\beta}} = -2(X^T)(y-X\tilde{\beta})=0 </math>, which gives us <math>\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty </math>. When <math>\ X^TX</math> is singular we have to use pseudo inverse for obtaining optimal <math>\ \tilde{\beta}</math>.<br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (for which the target values are encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2} </math>, where <math>\ n_i</math> is the number of data points in class i and n is the total number of points in the data set) this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==== Practical Usefulness ====<br />
Linear regression in general is not very useful for classification purposes. One of the main problems is that new data may not always have a positive ("more successful") impact on the linear regression learning algorithm due to the non-linear "binary" form of the classes. Consider the following simple example:<br />
<br />
[[File: linreg1.jpg|center|frame]]<br />
<br />
The boundary decision at <math>r(x)=0.5</math> was added for visualization purposes. Clearly, linear regression categories this data properly. However, consider adding one more datum:<br />
<br />
[[File: linreg2.jpg|center|frame]]<br />
<br />
This datum actually skews linear regression to the point that it misclassified some of the data points that should be labelled '1'. This shows how linear regression cannot adapt well to binary classification problems.<br />
<br />
==== general guidelines for building a regression model====<br />
<br />
# Make sure all relevant predictors are included. These are based on your research question, theory and knowledge on the topic.<br />
# Combine those predictors that tend to measure the same thing (i.e. as an index).<br />
# Consider the possibility of adding interactions (mainly for those variables with large effects)<br />
# Strategy to keep or drop variables:<br />
## Predictor not significant and has the expected sign -> Keep it<br />
## Predictor not significant and does not have the expected sign -> Drop it<br />
## Predictor is significant and has the expected sign -> Keep it<br />
## Predictor is significant but does not have the expected sign -> Review, you may need more variables, it may be interacting with another variable in the model or there may be an error in the data.<ref>http://dss.princeton.edu/training/Regression101.pdf</ref><br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.<ref>http://en.wikipedia.org/wiki/Logistic_regression</ref><br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid conditional density function since the two components (<math>f_1</math> and <math>f_2</math>, shown just below) sum to 1 and remain in [0, 1].<br />
<br />
It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /><br/><br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on [http://www.wolframalpha.com/input/?i=Plot%5BE^x/%281+%2B+E^x%29,+{x,+-10,+10}%5D%29 this graph] of <math>\ P(Y=1 | X=x) </math>.<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted <math>f_2</math>. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function <math>f_2</math> is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on [http://www.wolframalpha.com/input/?i=Plot%5B1/%281+%2B+E^x%29,+{x,+-10,+10}%5D%29 this graph] of <math>\ P(Y=0 | X=x) </math>.<br />
<br />
Since <math>f_1</math> and <math>f_2</math> specify the conditional distribution, the Bernoulli distribution is appropriate for specifying the likelihood of the class. Conveniently code the two classes via 0 and 1 responses, then the likelihood of <math>y_i</math> for given input <math>x_i</math> is given by,<br />
<br />
<math>f(y_i|\mathbf{x_i}) = (f_1(\mathbf{x_i}))^{y} (1-f_1\mathbf{x_i}))^{1-y} = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}</math><br />
<br />
Thus y takes value 1 with success probability <math>f_1</math> and value 0 with failure probability <math>1 - f_1</math>. We can use this to derive the likelihood for N training observations, and search for the maximizing parameter <math>\beta</math>. <br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (<math>x_i</math>) to<br />
our output (<math>y_i</math>). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem into an optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of <math>f (x|\theta)</math>, we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>. For example, we may know that the data come from a Gaussian distribution but we don't know the mean and variance of the distribution. <br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> the probability of x given <math>\mu</math>, while Frequentists use <math>f(\mathbf{x};\mu)</math> the probability of x and <math>\mu</math> occurring together. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f(\underline{\mathbf{x}}|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. where <math>\underline{\mathbf{x}}=\{x_i\}_{i=1}^{n}</math> (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}</math><br />
<br />
<math>\ell(\mathbf{\beta\,}) = \sum_{i=1}^n \left[ y_i \ln(P(Y=y_i|X=x_i)) + (1-y_i) \ln(1-P(Y=y_i|X=x_i))\right]<br />
</math><br />
<br />
This is the likelihood function we want to maximize. Note that <math>-\ell(\mathbf{\beta\,})</math> can be interpreted as the cost function we want to minimize. Simplifying, we get:<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
Now set <math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> equal to 0, and <math> \mathbf{\beta\,} </math> can be numerically solved by Newton's method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Raphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>, such Xs are called the roots of function f. Iteration can be used to solve for x using the following equation<br />
<math>\mathbf{x_n} = \mathbf{x_{n-1}} - \frac{f(\mathbf{x_{n-1}})}{f'(\mathbf{x_{n-1}})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction <math>\ \frac{f(x_{n-1})}{f'(x_{n-1})}</math> that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_n}</math>. Iterating from the original guess slowly converges to a solution that will be sufficiently accurate to the actual solution <math>\mathbf{x_n}</math>. Note that this may find local optimums, and each function may require multiple guesses to find all the roots.<br />
<br />
=====Matlab Example=====<br />
<br />
Below is the Matlab code to find a root of the function <math>\,y=x^2-2500</math> from the initial guess of <math>\,x=90</math>. The roots of this equation are trivially solved analytically to be <math>\,x=\pm 50</math>. <br />
<br />
x=1:100;<br />
y=x.^2 - 2500; %function to find root of<br />
plot(x,y);<br />
<br />
x_opt=90; %starting guess<br />
x_traversed=[];<br />
y_traversed=[];<br />
error=[];<br />
<br />
for i=1:6,<br />
y_opt=x_opt^2-2500;<br />
y_prime_opt=2*x_opt;<br />
<br />
%save results of each iteration<br />
x_traversed=[x_traversed x_opt];<br />
y_traversed=[y_traversed y_opt];<br />
error=[error abs(y_opt)];<br />
<br />
%update minimum<br />
x_opt=x_opt-(y_opt/y_prime_opt);<br />
end<br />
<br />
hold on;<br />
plot(x_traversed,y_traversed,'r','LineWidth',2);<br />
title('Progressions Towards Root of y=x^2 - 2500');<br />
legend('y=x^2 - 2500','Progression');<br />
xlabel('x');<br />
ylabel('y');<br />
<br />
hold off;<br />
figure();<br />
semilogy(1:6,error);<br />
title('Error vs Iteration');<br />
xlabel('Iteration');<br />
ylabel('Absolute Y Error');<br />
<br />
In this example the Newton method converges to an optimum to within machine precision in only 6 iterations as can be seen from the plot of the Y deviate below.<br />
<br />
[[File:newton_error.png]]<br />
[[File:newton_progression.png]]<br />
<br />
===Advantages/Limitation of Linear Regression ===<br />
<br />
*Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results.<br />
*Linear regression is often inappropriately used to model non-linear relationships.<br />
*Linear regression is limited to predicting numeric output.<br />
*A lack of explanation about what has been learned can be a problem.<br />
<br />
<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* It is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group.<br />
* It does not assume a linear relationship between the IV and DV.<br />
* It may handle nonlinear effects.<br />
* You can add explicit interaction and power terms.<br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independent variables be interval. <br />
* It does not require that the independent variables be unbounded.<br />
<br />
===Comparison Between Logistic Regression And Linear Regression===<br />
<br />
Linear regression is a regression where the explanatory variable X and response variable Y are linearly related. Both X and Y can be continuous variables, and for every one unit increase in the explanatory variable, there is a set increase or decrease in the response variable Y. A closed form solution exists for the least squares estimate of <math>\beta</math>.<br />
<br />
Logistic regression is a regression where the explanatory variable X and response variable Y are not linearly related. The response variable provides the probability of occurrence of an event. X can be continuous but Y must be a categorical variable (e.g., can only assume two values, i.e. 0 or 1). For every one unit increase in the explanatory variable, there is a set increase or decrease in the probability of occurrence of the event. No closed form solution exists for the least squares estimate of <math>\beta</math>.<br />
<br />
<br />
In terms of making assumptions on the data set: In LDA, we assumed that the probability density function (PDF) of each class and priors were Gaussian and Bernoulli, respectively. However, in Logistic Regression, we assumed that the PDF of each class had a parametric form and we ignored the priors. Therefore, we may conclude that Logistic regression has less assumptions than LDA.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have:<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \ln{\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} + (1 - y_i) \ln{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} \right) \end{align}</math><br />
<br />
This implies that:<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i \left( {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln(1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}) \right) - (1 - y_i)\ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method where we calculate the first and second derivative at each iteration.<br /><br />
<br /><br />
<br />
====Newton's Method====<br />
Here is how we usually implement Newton's Method: <math>\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f(\mathbf{x_n})}{f'(\mathbf{x_n})}.\,\!<br />
</math>. In our particular case, we look for x such that <math>g'(x) = 0</math>, and implement it by <math>\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f'(\mathbf{x_n})}{f''(\mathbf{x_n})}.\,\!<br />
</math>.<br /><br />
In practice, the convergence speed depends on |F'(x*)|, where F(x) = <math>\mathbf{x} - \frac{f(\mathbf{x})}{f'(\mathbf{x})}.\,\!</math>. The smaller the |F'(x*)| is, the faster the convergence is.<br /><br />
<br /><br />
<br /><br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(x_i|\beta) \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
where <math>\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} </math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(1 - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}</math><br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (P(x_i|\beta))(1 - P(x_i|\beta)) \right) \\[8pt] \end{align}</math><br />
<br />
again where <math>\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}} </math><br />
<br />
<math>\, \beta\,^{new} \leftarrow \beta\,^{old}-\frac {f(\beta\,^{old})}{f'(\beta\,^{old})} </math><br /><br />
<br \><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is. (However, some improper beta will cause I to be a singular matrix).<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1} S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
Let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
Let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
Let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
Let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
The score vector, information matrix and update equation can be rewritten in terms of this new matrix notation, so the first derivative is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
And the second derivative is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial^{2} \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1}S(\beta\,^{(r)} ) {}</math><br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\ \min_{\beta}(y-X \beta)^T(y-X \beta)</math><br />
<br />
Similarly, we can say that <math>\ \beta^{(r+1)}</math> is the solution of a weighted least square problem in the new space of <math>\ \mathbf{z} </math>: ( compare the equation of <math>\ \beta^{(r+1)}</math> with the solution of weighted least square <br />
<math>\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty </math> )<br />
<br />
<math>\beta^{(r+1)} \leftarrow arg \min_{\beta}(\mathbf{z}-X \beta)^T W (\mathbf{z}-X \beta)</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have ''K'' classes. For 2 classes ''K'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=K|X=x)} = e^{\beta_l^T x}</math><br /><br />
(this is resulting from <br />
<math>f_1(x)= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math> and <math>f_2(x)= (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math> )<br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = (\beta_l-\beta_k)^T x</math> , the log ratio of the posterior probabilities as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = \beta_1^T x</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = \beta_2^T x</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = \beta_{K-1}^T x</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}\;\;for \; k=1,\ldots, K-1</math> <br />
<br />
<math> P(Y=K|X=x)=\frac{1}{1+\sum_{i=1}^{K-1}{e^{\beta_i^T x}}} </math><br />
<br />
===Logistic Regression Vs. Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis (LDA) are widely used for classification. Both models build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
LDA used more parameters.<br />
<br />
However, these two models differ in their basic approach. While Logistic Regression is more relaxed and flexible in its assumptions, LDA assumes that its explanatory variables are normally distributed, linearly related and have equal covariance matrices for each class. Therefore, it can be expected that LDA is more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. <br />
<br />
<br />
Also, the total number of parameters to compute is different for Logistic Regression and LDA. If the explanatory variables have d dimensions and there are two classes to categorize, we need to estimate <math>\ d+1</math> parameters in Logistic Regression (all elements of the d by 1 <math>\ \beta </math> vector plus the scalar <math>\ \beta_0 </math>) and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in LDA (two mean values for the Gaussians, the d by d symmetric covariance matrices, and two priors for the two classes) and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
<br />
Note that the number of parameters also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations. Logistic Regression can be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a <math>-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}</math> penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton-Raphson method as given in Zhu and hastie (2004). LDA involves the inversion of a d x d covariance matrix. When d is bigger than n (where n is the number of observations) this matrix has rank n < d and thus is singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in [0,1]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>. More details on RDA can be found in Guo et al. (2006).<br />
<br />
<br />
Because the Logistic Regression model has the form <math>log\frac{f_1(x)}{f_0(x)} = \beta{x}</math>, we can clearly see the role of each input variable in explaining the outcome. This is one advantage that Logistic Regression has over other classification methods and is why it is so popular in data analysis. <br />
<br />
<br />
In terms of the performance speed, since LDA is non-iterative, unlike Logistic Regression which uses the iterative Newton-Raphson method, LDA can be expected to be faster than Logistic Regression.<br />
<br />
===Example===<br />
<br />
(Not discussed in class.) One application of logistic regression that has recently been used is predicting the winner of NFL games. Previous predictors, like Yards Per Carry (YPC), were used to build probability models for games. Now, the Success Rate (SR), defined as the percentage of runs in which the a team’s point expectancy has improved, is shown to be a better predictor of a team's performance. SR is based on down, distance and yard line and is less susceptible to rare breakaway plays that can be considered outliers. More information can be found at [http://fifthdown.blogs.nytimes.com/2011/09/29/n-f-l-game-probabilities-are-back-with-one-adjustment/].<br />
<br />
== Perceptron ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
Perceptron is a simple, yet effective, linear separator classifier. The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
The LS computes a linear combination of factor of input and returns the sign. <br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result (i.e it uses a step function as it's [http://en.wikipedia.org/wiki/Activation_function activation function] ). The figures on the right give an example of a perceptron. In these examples, <math>\ x^i</math> is the ''i''-th feature of a sample and <math>\ \beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes. From a geometrical point of view, Perceptron assigns label "1" to elements on one side of vector <math>\ \beta</math> and label "-1" to elements on the other of <math>\ \beta</math>, where <math>\ \beta</math> is a vector of <math>\ \beta_i</math>s.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions.<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as '''Multi-Layer Perceptron (MLP)''' which uses multiple layers of perceptrons with nonlinear activation functions so that it can classify data that is not linearly separable.<br />
<br />
=== History of Perceptrons and Other Neural Models ===<br />
One of the first perceptron-like models is the '''"McCulloch-Pitts Neuron"''' model developed by McCulloch and Pitts in the 1940's <ref> W. Pitts and W. S. McCulloch, "How we know universals: the perception of auditory and visual forms," ''Bulletin of Mathematical Biophysics'', 1947.</ref>. It uses a weighted sum of the inputs that is fed through an activation function, much like the perceptron. However, unlike the perceptron, the weights in the "McCulloch-Pitts Neuron" model are not adjustable, so the "McCulloch-Pitts Neuron" is unable to perform any learning based on the input data.<br />
<br />
As stated in the introduction of the [[#Perceptron | perceptron]] section, the '''Perceptron''' was developed by Rosenblatt around 1960. Around the same time as the perceptron was introduced, the '''Adaptive Linear Neuron (ADALINE)''' was developed by Widrow <ref name="Widrow"> B. Widrow, "Generalization and information storage in networks of adaline 'neurons'," ''Self Organizing Systems'', 1959.</ref>. The ADALINE differs from the standard perceptron by using the weighted sum (the net) to adjust the weights in the learning phase. The standard perceptron uses the output to adjust its weights (i.e. the net after it passed through the activation function). <br />
<br />
Since both the perceptron and ADALINE are only able to handle data that is linearly separable '''Multiple ADALINE (MADALINE)''' was introduced <ref name="Widrow"/>. MADALINE is a two layer network to process multiple inputs. Each layer contains a number of ADALINE units. The lack of an appropriate learning algorithm prevented more layers of units to be cascaded at the time and interest in "neural networks" receded until the 1980's when the backpropagation algorithm was applied to neural networks and it became possible to implement the '''Multi-Layer Perceptron (MLP)'''.<br />
<br />
Many importand advances have been boosted by the use of inexpensive computer emulations. Following an initial period of enthusiasm, the field survived a period of frustration and disrepute. During this period when funding and professional support was minimal, important advances were made by relatively few reserchers. These pioneers were able to develop convincing technology which surpassed the limitations identified by Minsky and Papert. Minsky and Papert, published a book (in 1969) in which they summed up a general feeling of frustration (against neural networks) among researchers, and was thus accepted by most without further analysis. Currently, the neural network field enjoys a resurgence of interest and a corresponding increase in funding.<ref><br />
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Historical background<br />
</ref><br />
<br />
== Perceptron Learning Algorithm (Lecture: Oct. 13, 2011) ==<br />
Like all of the learning methods we have seen, learning in a perceptron model is accomplished by minimizing a cost (or error) function, <math>\phi(\boldsymbol{\beta}, \beta_0)</math>. In the perceptron case, the cost function is simply the difference of the output (<math>sig(\sum_{i=0}^d \beta_i x^{(i)})</math>) and the target. To achieve this, we define a cost function, <math>\phi(\boldsymbol{\beta}, \beta_0)</math>, as a summation of the distance between all misclassified points and the hyper-plane, or the decision boundary. To minimize this cost function, we need to estimate <math>\boldsymbol{\beta, \beta_0}</math>. <br />
<br />
<math>\min_{\beta,\beta_0} \phi(\boldsymbol{\beta}, \beta_0)</math> = {distance of all misclassified points}<br />
<br />
The logic is as follows: <br />
<br />
[[File:hyperplane.png|thumb|250px|right| Distance between the point <math>\ x </math> and the decision boundary hyperplane <math>\ L </math> (black line). Note that the vector <math>\ \beta </math> is orthogonal to the decision boundary hyperplane and that points <math>\ x_0, x_1, x_2 </math> are arbitrary points on the decision boundary hyperplane. ]]<br />
<br />
'''1)''' Because a hyper-plane <math>\,L</math> can be defined as <br />
<br />
<math>\, L=\{x: f(x)=\beta^Tx+\beta_0=0\},</math><br />
<br />
<br />
For any two arbitrary points <math>\,x_1 </math> and <math>\,x_2 </math> on <math>\, L</math>, we have<br />
<br />
<math>\,\beta^Tx_1+\beta_0=0</math>,<br />
<br />
<math>\,\beta^Tx_2+\beta_0=0</math>,<br />
<br />
such that <br />
<br />
<math>\,\beta^T(x_1-x_2)=0</math>.<br />
<br />
Therefore, <math>\,\beta</math> is orthogonal to the hyper-plane and it is the normal vector.<br />
<br />
<br />
'''2)''' For any point <math>\,x_0</math> in <math>\ L,</math> <math>\,\;\;\beta^Tx_0+\beta_0=0</math>, which means <math>\, \beta^Tx_0=-\beta_0</math>.<br />
<br />
<br />
'''3)''' We set <math>\,\beta^*=\frac{\beta}{||\beta||}</math> as the unit normal vector of the hyper-plane<math>\, L</math>. For simplicity we call <math>\,\beta^*</math> norm vector. The distance of point <math>\,x</math> to <math>\ L</math> is given by<br />
<br />
<math>\,\beta^{*T}(x-x_0)=\beta^{*T}x-\beta^{*T}x_0<br />
=\frac{\beta^Tx}{||\beta||}+\frac{\beta_0}{||\beta||} <br />
=\frac{(\beta^Tx+\beta_0)}{||\beta||}</math><br />
<br />
Where <math>\,x_0</math> is any point on <math>\ L</math>. Hence, <math>\,\beta^Tx+\beta_0</math> is proportional to the distance of the point <math>\,x</math> to the hyper-plane<math>\, L</math>.<br />
<br />
<br />
'''4)''' The distance from a misclassified data point <math>\,x_i</math> to the hyper-plane <math>\, L </math> is<br />
<br />
<math>\,d_i = -y_i(\boldsymbol{\beta}^Tx_i+\beta_0)</math> <br />
<br />
where <math>\,y_i</math> is a target value, such that <math>\,y_i=1</math> if <math>\boldsymbol{\beta}^Tx_i+\beta_0<0</math>, <math>\,y_i=-1</math> if <math>\boldsymbol{\beta}^Tx_i+\beta_0>0</math><br />
<br />
Since we need to find the distance from the hyperplane to the ''misclassified'' data points, we need to add a negative sign in front. When the data point is misclassified, <math>\boldsymbol{\beta}^Tx_i+\beta_0</math> will produce an opposite sign of <math>\,y_i</math>. Since we need a positive sign for distance, we add a negative sign.<br />
<br />
=== Perceptron Learning using Gradient Descent ===<br />
<br />
The gradient descent is an optimization method that finds the minimum of an objective function by incrementally updating its parameters in the negative direction of the derivative of this function. That is, it finds the steepest slope in the D-dimensional space at a given point, and descends down in the direction of the negative slope. Note that unless the error function is convex, it is possible to get stuck in a local minima.<br />
In our case, the objective function to be minimized is classification error and the parameters of this function are the weights associated with the inputs, <math>\beta</math> . The gradient descent algorithm updates the weights as follows:<br />
<br />
<math>\beta^{\mathrm{new}} \leftarrow \beta^{\mathrm{old}} \rho \frac{\partial Err}{\partial \beta}</math><br />
<br />
<math>\rho </math> is called the ''learning rate''.<br /><br />
The Learning Rate <math> \rho </math> is positively related to the step size of convergence of <math>\min \phi(\boldsymbol{\beta}, \beta_0) </math>. i.e. the larger <math> \rho </math> is, the larger the step size is. Typically, <math>\rho \in [0.1, 0.3]</math>.<br />
<br />
The classification error is defined as the distance of misclassified observations to the decision boundary:<br />
<br />
<br />
To minimize the cost function <math>\phi(\boldsymbol{\beta}, \beta_0) = -\sum\limits_{i\in M} y_i(\boldsymbol{\beta}^Tx_i+\beta_0)</math> where <math>\ M=\{\text {all points that are misclassified}\}</math> <br><br />
<math>\cfrac{\partial \phi}{\partial \boldsymbol{\beta}} = - \sum\limits_{i\in M} y_i x_i </math> and <math> \cfrac{\partial \phi}{\partial \beta_0} = -\sum\limits_{i \in M} y_i</math><br />
<br />
Therefore, the gradient is<br />
<math>\nabla D(\beta,\beta_0)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}x_i \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
<br />
<br />
Using the gradient descent algorithm to solve these two equations, we have<br />
<math>\begin{pmatrix}<br />
\boldsymbol{\beta}^{\mathrm{new}}\\<br />
\beta_0^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix}<br />
\boldsymbol{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
+ \rho<br />
\begin{pmatrix}<br />
y_i x_i\\<br />
y_i<br />
\end{pmatrix}</math><br />
<br />
<br />
If the data is linearly-separable, the solution is theoretically guaranteed to converge to a separating hyperplane in a finite number of iterations. In this situation the number of iterations depends on the learning rate and the margin. However, if the data is not linearly separable there is no guarantee that the algorithm converges. <br />
<br />
<math>\begin{pmatrix}<br />
\beta^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
Note that we consider the offset term <math>\,\beta_0</math> separately from <math>\ \beta</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\ \beta</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions. Many works such as [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00298667 this paper] by ''Cetin et al.'' and [http://indian.cp.eng.chula.ac.th/cpdb/pdf/research/fullpaper/847.pdf this paper] by ''Atakulreka et al.'' have been done to tackle this issue.<br />
<br />
<br />
'''Features'''<br />
* A Perceptron can only discriminate between two classes at a time.<br />
* When data is (linearly) separable, there are an infinite number of solutions depending on the starting point.<br />
* Even though convergence to a solution is guaranteed if the solution exists, the finite number of steps until convergence can be very large.<br />
* The smaller the gap between the two classes, the longer the time of convergence.<br />
* When the data is not separable, the algorithm will not converge (it should be stopped after N steps).<br />
* A learning rate that is too high will make the perceptron periodically oscillate around the solution unless additional steps are taken.<br />
* The L.S compute a linear combination of feature of input and return the sign.<br />
* This were called Perceptron in the engineering literate in late 1950.<br />
* Learning rate affects the accuracy of the solution and the number of iterations directly.<br />
<br />
<br />
'''Separability and convergence'''<br />
<br />
The training set D is said to be linearly separable if there exists a positive constant <math>\,\gamma</math> and a weight vector <math>\,\beta</math> such that <math>\,(\beta^Tx_i+\beta_0)y_i>\gamma </math> for all <math>\,1 < i < n</math>. That is, if we say that <math>\,\beta</math> is the weight vector of Perceptron and <math>\,y_i</math> is the true label of <math>\,x_i</math>, then the signed distance of the <math>\,x_i</math> from <math>\,\beta</math> is greater than a positive constant <math>\,\gamma</math> for any <math>\,(x_i, y_i)\in D</math>.<br />
<br />
<br />
Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data set is linearly separable. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction that it has a negative dot product with, and thus can be bounded above by <math>O(\sqrt{t})</math>where t is the number of changes to the weight vector. But it can also be bounded below by<math>\, O(t)</math>because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector. This can be used to show that the number t of updates to the weight vector is bounded by <math> (\frac{2R}{\gamma} )^2</math> , where R is the maximum norm of an input vector.<ref>http://en.wikipedia.org/wiki/Perceptron</ref><br />
<br />
=== Choosing a Proper Learning Rate ===<br />
[[File:Learning_rate.jpg|500px|thumb|centre|choosing different learning rates affect the performance of gradient descent optimization algorithm.]]<br />
<br />
Choice of a learning rate value will affect the final result of gradient descent algorithm. If the learning rate is too small then the algorithm would take too long to converge which could cause problems for the situations where time is an important factor. If the learning rate is chosen too be too large, then the optimal point can be skipped and never converge. In fact, if the step size is too large, larger than twice the largest eigenvalue of the second derivative matrix (Hessian) of cost function, then gradient steps will go upward instead of downward. <br />
However, the step size is not the only factor than can cause these kind of situations: even with the same learning rate and different initial values algorithm might end up in different situations. In general it can be said that having some prior knowledge could help in choice of initial values and learning rate.<br />
<br />
There are different methods of choosing the step size in an gradient descent optimization problem. The most common method is choosing a fixed learning rate and finding a proper value for it by trial and error. This for sure is not the most sophisticated method, but the easiest one.<br />
Learning rate can also be adaptive; that means the value of learning rate can be different at each step of the algorithm. This can be specially a helpful approach when one is dealing with on-line training and non-stationary environments (i.e. when data characteristics vary over time). In such a case learning rate has to be adapted at each step of the learning algorithm. Different approaches and algorithms for learning rate adaptation can be found in <ref><br />
V P Plagianakos, G D Magoulas, and M N Vrahatis, Advances in convex analysis and global optimization Pythagorion 2000 (2001), Volume: 54, Publisher: Kluwer Acad. Publ., Pages: 433-444.<br />
</ref>.<br />
<br />
The learning rate leading to a local error minimum in the error function in one learning step is optimal. <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
=== Application of Perceptron: Branch Predictor ===<br />
<br />
Perceptron could be used for both online and batch learning. Online learning tasks take place in a sequence of trials. In each round of trial, the learner is given an instance and is asked to use his current knowledge to predict a label for the point. In online learning, the true label of the point is revealed to learner at each round after he makes a prediction. At the last stage of each round the learner has a chance to use the feedback he received on the true label of the instance to help improve his belief about the data for future trials.<br />
<br />
Instruction pipelining is a technique to increase the throughput in modern microprocessor architecture. A microprocessor instruction can be broken into several independent steps. In a single CPU clock cycle, several instructions at different stage can be executed at the same time. However, a problem arises with a branch, e.g. if-and-else- statement. It is not known whether the instructions inside the if- or else- statements will be executed until the condition is executed. This stalls the pipeline.<br />
<br />
A branch predictor is used to address this problem. Using a predictor the pipelined processor predicts the execution path and speculatively executes instructions in the branch. Neural networks are good technique for prediction; however, they are expensive for microprocessor architecture. A research studied the use of perceptron, which is less expensive and simpler to implement, as the branch predictor. The inputs are the history of binary outcomes of the executed branches. The output of the predictor is whether a particular branch will be taken. Every time a branch is executed and its true outcome is known, it can be used to train the predictor. The experiments showed that with a 4 Kb hardware, a global perceptron predictor has a misprediction rate of 1.94%, a superior accuracy. <ref>Daniel A. Jimenez , Calvin Lin, "Neural Methods for Dynamic Branch Prediction", ACM Transactions on Computer Systems, 2002</ref><br />
<br />
== Feed-Forward Neural Networks ==<br />
<br />
* The term 'neural networks' is used because historically, it was used to describe the processes of the brain (e.g. synapses).<br />
<br />
* A neural network is a multistate regression model which is typically represented by a network diagram (see right).<br />
[[Image:Feed-Forward_neural_network.png|right|thumb|300px|Feed Forward Neural Network]]<br />
<br />
* The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.<ref>http://en.wikipedia.org/wiki/Feedforward_neural_network</ref><br />
<br />
* For regression, typically k = 1 (the number of nodes in the last layer), there is only one output unit <math>y_1</math> at the end.<br />
<br />
* For c-class classification, there are typically c units at the end with the cth unit modelling the probability of class c, each <math>y_c</math> is coded as 0-1 variable for the cth class.<br />
<br />
* Neural networks are known as ''universal approximators'', where a two-layer feed-forward neural network can approximate any continuous function to an arbitrary accuracy (assuming sufficient hidden nodes exist and that the necessary parameters for the neural network can be found) <ref name="CMBishop">C. M. Bishop, ''Pattern Recognition and Machine Learning''. Springer, 2006</ref>. It should be noted that fitting training data to a very high accuracy may lead to ''overfitting'', which is discussed later in this course.<br />
<br />
* We often use Perceptron to blocks in Feed-Forward neural networks. We can easily to solve the problem by using Perceptron in many different classes. Feed-Forward neural networks looks like a complicated system of Perceptrons. We can regard the neural networks as an unit or a subset of Neural Network. Feed-Forward neural networks include many hidden layers of perceptron.<br />
<br />
=== Backpropagation (Finding Optimal Weights) === <br />
There are many algorithms for calculating the weights in a feed-forward neural network. One of the most used approaches is the backpropagation algorithm. The application of the backpropagation algorithm for neural networks was popularized in the 1980's by researchers like Rumelhart, Hinton and McClelland (even though the backpropagation algorithm had existed before then). <ref>S. Seung, "Multilayer perceptrons and backpropagation learning" class notes for 9.641J, Department of Brain & Cognitive Sciences, MIT, 2002. Available: [http://hebb.mit.edu/courses/9.641/2002/lectures/lecture04.pdf] </ref><br />
<br />
As the learning part of the network (the first part being feed-forward), backpropagation consists of "presenting an input pattern and changing the network parameters to bring the actual outputs closer to the desired teaching or target values." It is one of the "simplest, most general methods for the supervised training of multilayer neural networks." (pp. 288-289) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
For the backpropagation algorithm, we consider three hidden layers of nodes<br />
<br />
Refer to figure from October 18th lecture where <math>\ l</math> represents the column of nodes in the first column, <br><br />
<math>\ i</math> represents the column of nodes in the second column, and <br><br />
<math>\ k</math> represents the column of nodes in the third column. <br><br />
<br />
We want the output of the feed forward neural network <math>\hat{y}</math> to be as close to the known target value <math>\ y </math> as possible (i.e. we want to minimize the distance between <math>\ y </math> and <math>\hat{y}</math>). Mathematically, we would write it as: <br />
Minimize <math>(\left| y- \hat{y}\right|)^2</math><br />
<br />
Instead of the sign function that has no derivative we use the so called logistic function (a smoothed form of the sign function):<br />
<br />
<math> \sigma(a)=\frac{1}{1+e^{-a}} </math><br />
<br />
<br />
<blockquote> "Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification." <ref>Friedman, J., Hastie, T. and Tibshirani, R. (2008) “The Elements of Statistical Learning”, 2nd ed, Springer.</ref> </blockquote> <br />
<br />
<br />
''Logistic function'' is a common [http://en.wikipedia.org/wiki/Logistic_function sigmoid curve] .It can model the S-curve of growth of some population <math> \sigma</math>. The initial stage of growth is approximately exponential; then, as saturation begins, the growth slows, and at maturity, growth stops. <br />
<br />
<br />
To solve the optimization problem, we take the derivative with respect to weight <math>u_{il}</math>: <br><br />
<math>\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} \cdot \cfrac{\partial a_j}{\partial u_{il}}</math> by Chain rule <br><br />
<math>\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \delta_j \cdot z_l </math> <br />
<br />
where <math> \delta_j = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} </math> which will be computed recursively.<br />
<br />
<math>\ a_i=\sum_{l}z_lu_{il}</math> <br />
<br />
<math>\ z_i=\delta(a_i)</math><br />
<br />
<math>\ a_j=\sum_{i}z_iu_{ji}</math><br><br />
<br />
== Backpropagation Continued (Lecture: Oct. 18, 2011) ==<br />
[[File:Backprop.png|300px|thumb|right|Nodes from three hidden layers within the neural network are considered for the backpropagation algorithm. Each node has been divided into the weighted sum of the inputs <math>\ a </math> and the output of the activation function <math>\ z </math>. The weights between the nodes are denoted by <math>\ u </math>.]]<br />
<br />
From the figure to the right it can be seen that the input (<math>\ a </math>'s) can be expressed in terms of the weighted sum of the outputs of the previous nodes and output (<math>\ z </math>'s) can be expressed as the input as follows:<br />
<br />
<math>\ a_i = \sum_l z_l u_{il} </math><br />
<br />
<math>\ z_i = \sigma(a_i) </math><br />
<br />
<br />
The goal is to optimize the weights to reduce the L2-norm between the target output values <math>\ y </math> (i.e. the correct labels) and the actual output of the neural network <math>\ \hat{y} </math>:<br />
<br />
<math>\left(y - \hat{y}\right)^2</math><br />
<br />
Since the L2-norm is differentiable, the optimization problem can be tackled by differentiating <math>\left(y - \hat{y}\right)^2</math> with respect to each weight in the hidden layers. By using the chain rule we get:<br />
<br />
<math><br />
\cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}}<br />
= \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial u_{il}} = \delta_{i}z_l<br />
</math><br />
<br />
where <math>\ \delta_i = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_i} </math><br />
<br />
The above equation essentially shows the effect of changes in the input <math>\ a_i </math> on the overall output <math>\ \hat{y} </math> as well as the effect of changes in the weights <math>\ u_{il} </math> on the input <math>\ a_i </math>. In the above equation, <math>\ z_l </math> is a known value (i.e. it can be calculated directly), whereas <math>\ \delta_i </math> is unknown but can be expressed as a recursive definition in terms of <math>\ \delta_j</math>:<br />
<br />
<math>\delta_i = \cfrac{\partial (y - \hat{y})^2}{\partial a_i} = \sum_{j} \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_j}\cdot \cfrac{\partial a_j}{\partial a_i} </math><br />
<br />
<math>\delta_i = \sum_{j}\delta_j\cdot\cfrac{\partial a_j}{\partial z_i}\cdot\cfrac{\partial z_i}{\partial a_i}</math><br />
<br />
<math>\delta_i = \sum_{j} \delta_j\cdot u_{ji} \cdot \sigma'(a_i)</math><br />
<br />
where <math> \delta_j = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_j}</math><br />
<br />
The above equation essentially shows the effect of changes in the input <math>\ a_j </math> on the overall output <math>\ \hat{y} </math> as well as the effect of changes in input <math>\ a_i </math> on the input <math>\ a_j </math>. Note that if <math>\sigma(x)</math> is the sigmoid function, then <math>\sigma'(x) = \sigma(x)(1-\sigma(x))</math><br />
<br />
The recursive definition of <math>\ \delta_i </math> can be considered as a cost function at layer <math>i</math> for achieving the original goal of optimizing the weights to minimize <math>\left(y - \hat{y}\right)^2</math>:<br />
<br />
<math>\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}</math>.<br />
<br />
Now considering <math>\ \delta_k</math> for the output layer:<br />
<br />
<math>\delta_k= \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial a_k}</math>.<br />
<br />
where <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes:<br />
<br />
<math>\delta_k = \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial \hat{y}} </math><br />
<br />
<math>\delta_k = -2(y - \hat{y})</math><br /><br />
<math>u_{il} \leftarrow u_{il} - \rho \cfrac{\partial (y - \hat{y})<br />
^2}{\partial u_{il}}</math><br />
<br />
Since <math>\ y </math> is known and <math>\ \hat{y} </math> can be computed for each data point (assuming small, random, initial values for the weights of the neural network), <math>\ \delta_k </math> can be calculated and "backpropagated" (i.e. the <math>\ \delta </math> values for the layer before the output layer can be computed using <math>\ \delta_k </math> and then the <math>\ \delta </math> values for the layer before the layer before the output layer can be computed etc.). Once all <math>\ \delta </math> values are known, the errors due to each of the weights <math>\ u </math> will be known and techniques like gradient descent can be used to optimize the weights. However, as the cost function for <math>\ \delta_i </math> shown above is not guaranteed to be convex, convergence to a global minimum is no guaranteed. This also means that changing the order in which the training points are fed into the network or changing the initial random values for the weights may lead to finding different results for the optimized weights (i.e. different local minima may be reached). <br />
<br />
===Overview of Full Backpropagation Algorithm ===<br />
The network weights are updated using the backpropagation algorithm when each training data point <math>\ x</math>is fed into the feed forward neural network (FFNN). This update procedure is done using the following steps: <br />
<br />
*First arbitrarily choose some random weights (preferably close to zero) for your network.<br />
<br />
*Apply <math>\ x </math> to the FFNN's input layer, and calculate the outputs of all input neurons.<br />
<br />
*Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.<br />
<br />
*Once <math>\ x </math> reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.<br />
<br />
*At the output layer, compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math> for each output neuron(s).<br />
<br />
*Compute each <math> \delta_i </math>, starting from <math>i=k-1</math> all the way to the first hidden layer, where <math>\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}</math>.<br />
<br />
*Compute <math>\cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}} = \delta_{i}z_l</math> for all weights <math>\,u_{il}</math>.<br />
<br />
*Then update <math>u_{il}^{\mathrm{new}} \leftarrow u_{il}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left(y - \hat{y}\right)^2}{\partial u_{il}} </math> for all weights <math>\,u_{il}</math>.<br />
<br />
*Continue for next data points and iterate on the training set until weights converge.<br />
<br />
====Epochs====<br />
It is common to cycle through the all of the data points multiple times in order to reach convergence. An epoch represents one cycle in which you feed all of your datapoints through the neural network. It is good practice to randomized the order you feed the points to the neural network within each epoch; this can prevent your weights changing in cycles. The number of epochs required for convergence depends greatly on the learning rate & convergence requirements used.<br />
<br />
===Limitations===<br />
*The convergence obtained from backpropagation learning is very slow.<br />
<br />
*The convergence in backpropagation learning is not guaranteed.<br />
<br />
*The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.<br />
<br />
*Backpropagation learning requires input scaling or normalization. Inputs are usually scaled into the range of +0.1f to +0.9f for best performance.<ref>http://en.wikipedia.org/wiki/Backpropagation</ref><br />
<br />
*Numerical problems may be encountered when there are a large number of hidden layers, as the errors at each layer may become very small and vanish. <br />
<br />
===Deep Neural Network===<br />
<br />
Increasing the number of units within a hidden layer can increase the "flexibility" of the neural network, i.e. the network is able to fit to more complex functions. Increasing the number of hidden layers on the other hand can increase the "generalizability" of the neural network, i.e. the network is able to generalize well to new data points that it was not trained on. A deep neural network is a neural network with many hidden layers. Deep neural networks were introduced in recent years by the same researchers (Hinton et al. <ref name="HintonDeepNN"> G. E. Hinton, S. Osindero and Y. W. Teh, "A Fast Learning Algorithm for Deep Belief Nets", ''Neural Computation'', 2006. </ref>) that introduced the backpropagation algorithm to neural networks. The increased number of hidden layers in deep neural networks cannot be directly trained using backpropagation, because the errors at each layer will become very small and vanish as stated in the [[#Limitations | limitations]] section. To get around this problem, deep neural networks are trained a few layers at a time (i.e. two layers at a time). This process is still not straightforward as the target values for the hidden layers are not well defined (i.e. it is unknown what the correct target values are for the hidden layers given a data point and a label). ''Restricted Boltzmann Machines (RBM)'' and ''Greedy Learning Algorithms'' have been used to address this issue. For more information about how deep neural networks are trained, please refer to <ref name="HintonDeepNN"/>. A comparison of various neural network layouts including deep neural networks on a database of handwritten digits can be found at [http://yann.lecun.com/exdb/mnist/ THE MNIST DATABASE].<br />
<br />
one of the advantages of Deep Nets is that we can pre-train network using unlabeled data (Unsupervised learning) to obtain initial weights for final<br />
step training using labeled data(fine-tuning). Since most of data available are usually unlabeled data, this method gives us a great chance of finding better local optima than if we just wanted to use labeled data for training the parameters of the network(the weights). for more details on unsupervised pre-training and learning in Deep Nets see<ref><br />
http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf<br />
</ref> , <ref><br />
http://www.cs.toronto.edu/~hinton/absps/tics.pdf<br />
</ref><br />
<br />
An interesting structure of the deep neural network is where the number of nodes in each hidden layer decreases towards the "center" of the network and then increases again. See figure below for an illustration.<br />
<br />
[[File:DeepNNarchitecture.png|500px|thumb|center|A specific architecture for deep neural networks with a "bottleneck".]]<br />
<br />
The central part with the least number of nodes in the hidden layer can be seen a reduced dimensional representation of the input data features. It would be interesting to compare the dimensionality reduction effect of this kind of deep neural network to a cascade of PCA.<br />
<br />
It is known that training DNNs is hard <ref>http://ecs.victoria.ac.nz/twiki/pub/Courses/COMP421_2010T1/Readings/TrainingDeepNNs.pdf</ref> since randomly initializing weights for the network and applying gradient descent can find poor local minimums. In order to better train DNNs, [http://ecs.victoria.ac.nz/twiki/pub/Courses/COMP421_2010T1/Readings/TrainingDeepNNs.pdf Exploring Strategies for Training Deep Neural Networks] looks at 3 principles to better train DNNs:<br />
# Pre-training one layer at a time in a greedy way,<br />
# Using unsupervised learning at each layer,<br />
# Fine-tuning the whole network with respect to the ultimate criterion.<br />
Their experiments show that by providing hints at each layer for the representation, the weights can be initialized such that a more optimal minimum can be reached.<br />
<br />
===Applications of Neural Networks===<br />
* Sales forecasting<br />
* Industrial process control<br />
* Customer research<br />
* Data validation<br />
* Risk management<br />
* Target marketing<br />
<ref><br />
Reference:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Applications of neural networks<br />
</ref><br />
<br />
==Model Selection (Complexity Control)==<br />
<br />
<br />
<br />
Selecting a proper statistical model for a given data set is a well-known problem in pattern recognition and machine learning. Systems with the optimal complexity have a good [http://www.csc.kth.se/~orre/snns-manual/UserManual/node16.html generalization] to yet unobserved data. In the complexity control problem, we are looking for an appropriate model order which gives us the best generalization capability for the unseen data points, while fitting the seen data well. Model complexity here can be defined in terms of over-fitting and under-fitting situations defined in the following section.<br />
<br />
== Over-fitting and Under-fitting ==<br />
[[File:overfitting-model.png|500px|thumb|right| Example of overfitting and underfitting situations. The blue line is a high-degree polynomial which goes through most of the training data points and gives a very low training error, however has a very poor generalization for the unseen data points. The red line, on the other hand, is underfitted to the training data samples.]]<br />
There are two situations which should be avoided in classification and pattern recognition systems:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
In short, Overfitting occurs when the model tries to capture every detail of the data. This can happen if the model has too many parameters compared to the number of observations. Overfitted models have large testing errors but small training error. On the other hand, Underfitting occurs when the model does not capture the complexity of the data. This happens when the model has a large training error, and can be common when there is missing data.<br />
<br />
Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function, and the only goal is to build a model that is as complex as needed to pass through every training data point. <br />
<br />
However, in the real-world, the training data are [http://en.wikipedia.org/wiki/Statistical_noise noisy], i.e. they tend to not lie exactly on the underlying function, instead they may be shifted to unpredictable locations by random noise. If the model is more complex than what it needs to be in order to accurately fit the underlying function, then it would end up fitting most or all of the training data. Consequently, it would be a poor approximation of the underlying function and have poor prediction ability on new, unseen data. <br />
<br />
The danger of overfitting is that the model becomes susceptible to predicting values outside of the range of training data. It can cause wild predictions in multilayer perceptrons, even with noise-free data. To avoid Overfitting, techniques such as Cross Validation and Model Comparison might be necessary. The size of the training set is also important. The training set should have a sufficient number of data points which are sampled appropriately, so that it is representative of the whole data space.<br />
<br />
In a Neural Network, if the number of hidden layers or nodes is too high, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will fit the training set very precisely, but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has a high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
=== Different Approaches for Complexity Control ===<br />
<br />
We would like to have a classifier that minimizes the true error rate <math>\ L(h)</math>:<br />
<br />
<math>\ L(h)=Pr\{h(x)\neq y\}</math><br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|240px|thumb|right| Model complexity]]</span><br />
<br />
Because the true error rate cannot be determined directly in practice, we can try using the empirical true error rate (i.e. training error rate): <br />
<br />
<math>\ \hat L(h)= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
However, the empirical true error rate (i.e. training error rate) is biased downward. Minimizing this error rate does not find the best classifier model, but rather ends up overfitting to the training data. Thus, this error rate cannot be used.<br /><br />
<br />
The complexity of a fitting model depends on the degree of the fitting function. According to the graph, the area on the LHS of the critical point is considered as under-fitting. This inaccuracy is resulted by the low complexity of fitting. The area on the RHS of the critical point is over-fitting, because it's not generalized.<br /><br />
<br />
As illustrated in the figure to the right, the training error rate is always less than the true error rate, i.e. "biased downward". Also, the training error will always decrease with an increase in the complexity of the model used to fit the data. This does not reflect the behavior of the true error rate. The true error rate will have a unique minimum as the model complexity changes. <br />
<br />
So, if the training error rate is the only criteria used for picking a model, overfitting can occur. An overfitted model has low training error rate, but is not able to generalize well to new test data points. On the other hand, underfitting can occur when a model that is not complex enough is picked (e.g. using a first order model for data that follows a second order trend). Both training and test error rates will be high in that case. The best choice for the model complexity is where the true error rate reaches its minimum point. Thus, model selection involves ''controlling the complexity'' of the model. The true error rate can be approximated using the test error rate, i.e. the test error follows the same trend that the true error rate does when the model complexity is changed. <br />
In this case, we assume there is a test data set <math>\,x_1, . . . ,x_n</math> and these points follow some unknown distribution. In order to find out this distribution, we can make some estimationg of some unknown parameters, such as <math>\,f</math>, the mean <math>\,E(x_i)</math>, the variance <math>\,var(x_i)</math> and more.<br />
<br />
To estimate <math>\,f</math>, we use an observation function as our estimator. <br />
<br />
<math>\hat{f}(x_1,...,x_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]=Variance (\hat f)+Bias^2(\hat f )</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
This estimator is unbiased.<br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f=0</math><br />
<br />
which means that we just need to minimize <math>MSE (\hat{f})</math>.<br />
<br />
<math>\implies MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. <br />
<br />
Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
<br />
<br />
<br />
In order to avoid overfitting, there are two main strategies:<br />
<br />
# ''Estimate the error rate''<br />
## Cross-validation<br />
## Computing error bound ( probability in-equality )<br />
# ''Regulazition''<br />
## We basically make the function (model) smooth by limiting the complexity or by limiting the size of weights.<br />
<br />
===Cross Validation===<br />
<br />
[[File:k-fold.png|350px|thumb|right|Graphical illustration of 4-fold cross-validation. V is the part used for validation and T is used for training.]]<br />
<br />
Cross-validation is an approach for avoiding overfitting while modelling data that bases the choice of model parameters on a portion of the training set, while using the rest of the set for validation, i.e., some of the data is left out when fitting the model. One round of the process involves partitioning the data set into two complementary subsets, fitting the model to one subset (called the training set), and testing the model against the other subset (called the validation or testing subset). This is usually repeated several times using different partitions in order to reduce variability, and the validation results are then averaged over the rounds.<br />
<br />
====LOO: Leave-one-out cross-validation ====<br />
When the dataset is very small, leaving one tenth out depletes our data too much, but making the validation set too small makes the estimate of the true error unstable (noisy). One solution is to do a kind of round-robin validation: for each complexity setting, learn a classifier on all the training data minus one example and evaluate its error the remaining example. Leave-one-out error is defined as:<br />
<br />
'''LOO error''': <math>\frac {1}{n} \sum_{i} 1 (h(x_i; D_-i)\neq y_i)</math><br />
where <math>D_-i</math> is the dataset minus ith example and <math>h(x_i; D_-i)</math> is the classifier learned on <math>D_-i</math>. LOO error is an unbiased estimate of the error of our learning algorithm (for a given complexity setting) when given <math>n-1</math> examples.<br />
<br />
====K-Fold Cross Validation====<br />
<br />
Instead of minimizing the training error, here we minimize the validation error.<br /><br />
<br />
A common type of cross-validation that is used for relatively small data sets is K-fold cross-validation, the algorithm for which can be stated as follows:<br />
<br />
Let h denote a classification model to be fitted to a given data set.<br />
<br />
# Randomly partition the original data set into K subsets of approximately the same size. A common choice for K is K = 10.<br />
# For k = 1 to K do the following<br />
## Remove subset k from the data set<br />
## Estimate the parameters of each different classification model based only on the remaining data points. Denote the resulting function by h(k)<br />
## Use h(k) to predict the data points in subset k. Denote by <math>\begin{align}\hat L_k(h)\end{align}</math> the observed error rate.<br />
# Compute the average error <math>\hat L(h) = \frac{1}{K} \sum_{k=1}^{K} \hat L_k(h)</math><br />
<br />
The best classifier is the model that results in the lowest average error rate.<br />
<br />
A common variation of k-fold cross-validation uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is then repeated such that each sample is used once for validation. It is the same as a K-fold cross-validation with K being equal to the number of points in the data set, and is referred to as leave-one-out cross-validation. <ref> stat.psu.edu/~jiali/course/stat597e/notes2/percept.pdf</ref><br />
<br />
====Alternatives to Cross Validation for model selection:====<br />
# Akaike Information Criterion (AIC): This approach ranks models by their AIC values. The model with the minimum AIC is chosen. The formula of AIC value is: <math>AIC = 2k + 2log(L_{max})</math>, where <math>k</math> is the number of parameters and <math>L_{max}</math> is the maximum value of the likelihood function of the model. This selection method penalizes the number of parameters.<ref>http://en.wikipedia.org/wiki/Akaike_information_criterion</ref><br />
# Bayesian Information Criterion (BIC): It is similar to AIC but penalizes the number of parameters even more. The formula of BIC value is: <math>BIC = klog(n) - 2log(L)</math>, where <math>n</math> is the sample size.<ref>http://en.wikipedia.org/wiki/Bayesian_information_criterion</ref><br />
<br />
== Model Selection Continued (Lecture: Oct. 20, 2011) ==<br />
<br />
=== Error Bound Computation ===<br />
Apart from cross validation, another approach for estimating the error rates of different models is to find a bound to the error. This works well theoretically to compare different models, however, in practice the error bounds are not a good indication of which model to pick because the error bounds are not ''tight''. This means that the actual error observed in practice may be a lot better than what was indicated by the error bounds. This is because the error bounds indicate the worst case errors and by only comparing the error bounds of different models, the worst case performance of each model is compared, but not the overall performance under normal conditions. <br />
<br />
=== Penalty Function ===<br />
Another approach for model selection to avoid overfitting is to use ''regularization''. Regularization involves adding extra information or restrictions to the problem in order to prevent overfitting. This additional information can be in the form of a function penalizing high complexity (penalty function). So in regularization, instead of minimizing the squared error alone we attempt to minimize the squared error plus a penalty function. A common penalty function is the euclidean norm of the parameter vector multiplied by some scaling parameter. The scaling parameter allows for balancing the relative importance of the two terms. <br /> This means minimizing the following new objective function:<br /><br />
<math> \left|y-\hat{y}\right|^2+f(\theta)</math><br /><br />
where <math>\ \theta</math> is model complexity and <math>\ f(\theta)</math> is the penalty function. The penalty function should increase as the model increases in complexity. This way it counteracts the downward bias of the training error rate. There is no optimal choice for a penalty function but they should all increase all the complexity and size of the estimates increase. <br />
<br />
There is no optimal choice for the penalty function but they all seek to solve the same problem. Suppose you have models of order 1,2,...,K such that the models of class k-1 are a subset of the models in class k. An example of this is linear regression where a model of order k is the model with the first k explanatory covariates. If you do not include a penalty term and minimize the squared error alone you will always choose the largest most complex model (K). But the problem with this is the gain from including more complexity might be incredibly small. The gain in accuracy may in fact be no better than you would expect from including a covariate drawn from a N(0,1) distribution. If this is the case then clearly we don't want to include such a covariate. And in general if the increase in accuracy is below a certain level then it is preferable to stay with the simpler model. By adding a penalty term, no matter how small it is, you know at least at some point these insignificant gains in accuracy will be outweighed by increase in penalty. By effectively choosing and scaling your penalty function you can have your objective function approximate the true error as opposed to the training error.<br />
<br /><br />
<br />
==== Example: Penalty Function in Neural Network Model Selection ====<br />
<br />
In MLP neural networks, the activation function is of the form of a logistic function, where the function behaves almost linearly when the input is close to zero (i.e., the weights of the neural network are close to zero), while the function behaves non-linearly as the magnitude of the input increases (i.e., the weights of the neural network become larger). In order to penalize additional model complexity (i.e., unnecessary non-linearities in the model), large weights will be penalized by the penalty function.<br />
<br />
The objective function to minimize with respect to the weights <math>\ u_{ji}</math> is:<br /><br />
<br />
<math>\ Reg=\left|y-\hat{y}\right|^2 + \lambda*\sum_{i=1}^{n}(u_{ji})^2</math> <br />
If the weight start to grow, then <math>\sum_{i=1}^{n}(u_{ji})^2</math> becomes larger and <math>\left|y-\hat{y}\right|^2</math> becomes smaller.<br />
<br />
The derivative of the objective function with respect to the weights <math>\ u_{ji}</math> is:<br /><br />
<math>\cfrac{\partial Reg}{\partial u_{ji}} = \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji}</math> <br />
<br />
This objective function is used during [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. In practice, cross validation is used to determine the value of <math>\ \lambda</math> in the objective function.<br /><br />
<br />
We can do CV to choose <math>\lambda</math>. In case of any model, the least "complex" is the linear model. gradually let the complexity to grow then complexity begins to rise.<br />
<br />
We want non-linear model but not too curvy.<br />
<br />
==== Penalty Functions in Practice ====<br />
In practice, we only apply the penalty function to the parametrized terms. That is, the bias term is not regularized, since it is simply the DC component and is not associated with a feature. Although this makes little difference, the concept is clear that the bias term should not be considered when determining the relative weights of the features.<br />
<br />
In particular, we update the weights as follows:<br />
<br />
<math><br />
u_{ji} := <br />
\begin{cases} <br />
u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}} &bias term\\<br />
u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji} &otherwise<br />
\end{cases}<br />
</math><br />
<br />
== Radial Basis Function Neural Network (RBF NN) ==<br />
[http://en.wikipedia.org/wiki/Radial_basis_function_network Radial Basis Function Network](RBF) NN is a type of neural network with only one hidden layer in addition to an input and output layer. Each node within the hidden layer uses a radial basis activation function, hence the name of the RBF NN. A radial basis function is a real-valued function whose value depends only on the distance from center. One of the most commonly used radial basis functions is Gaussian. The weights from the input layer to the hidden layer are always "1" in a RBF NN, while the weights from the hidden layer to the output layer are adjusted during training. The output unit implements a weighted sum of hidden unit outputs. The input into an RBF NN is nonlinear while the output is linear. Due to their nonlinear approximation properties, RBF NNs are able to model complex mappings, which perceptron based neural networks can only model by means of multiple hidden layers. It can be trained without back propagation since it has a closed-form solution. RBF NNs have been successfully applied to a large diversity of applications including interpolation, chaotic time series modeling, system identification, control engineering, electronic device parameter modeling, channel equalization, speech recognition, image restoration, shape-form-shading, 3-D object modeling, motion estimation and moving object segmentation, data fusion, etc. <ref>www-users.cs.york.ac.uk/adrian/Papers/Others/OSEE01.pdf</ref><br />
<br />
====The Network System====<br />
<br />
1. Input: <br />n data points <math>\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n</math><br /><br />
2. Basis function ('''the single hidden layer'''): <br /><br />
<math>\mathbf{\phi}_{n*m}</math>, where <math>m</math> is the number of the neurons/basis functions that project original data points into a new space. <br /><br />
There are many choices for the basis function. The commonly used is radial basis:<br /><br />
<math>\phi_j(\mathbf{x}_i)=e^{-|\mathbf{x}_i-\mathbf{\mu}_j|^2}</math><br /><br />
3. Weights associated with the last layer: <math>\mathbf{W}_{m*k}</math>, where k is the number of classes in the output <math>\mathbf{Y}</math>.<br /><br />
4. Output: <math>\mathbf{Y}</math>, where<br /><br />
<math>y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))</math><br /><br />
Alternatively, the output <math>\mathbf{Y}</math> can be written as<br />
<math><br />
Y=\phi*W<br />
</math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x}_1) & \phi_{2}(\mathbf{x}_1) & \cdots & \phi_{m}(\mathbf{x}_1) \\<br />
\phi_{1}(\mathbf{x}_2) & \phi_{2}(\mathbf{x}_2) & \cdots & \phi_{m}(\mathbf{x}_2) \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{1}(\mathbf{x}_n) & \phi_{2}(\mathbf{x}_n) & \cdots & \phi_{m}(\mathbf{x}_n)<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors. If m = n, then <math>\mathbf{\mu}_i = \mathbf{x}_i</math>, so <math>\phi_{i}</math> checks to see how similar the two data points are.<br />
<br />
<math>Y=\phi W</math> where Y and <math>\phi</math> are known while W is unknown.<br />
The object function is <math>\psi=|Y-\Phi W|^2 </math> and we want to <math> \underset{W}{\mbox{min}} |Y-\Phi W|^2 </math>. Therefore, to get the optimal weight, <math>W=(\phi^T \phi)^{-1}\phi^TY</math><br />
<br />
==== Network Training====<br />
To construct m basis functions, first cluster data points into m groups. Then find the centre of each cluster <math>\mu_1</math> to <math>\mu_m</math>.<br /><br />
<br />
'''Clustering: the K-means algorithm''' <ref>This section is taken from Wikicourse notes stat441/841 fall 2010.</ref><br /><br />
K-means is a commonly applied technique in clustering observations into groups by minimizing the distance of individual observations from the center of the cluster it is in. The most common K-means algorithm used is referred to as [http://en.wikipedia.org/wiki/Lloyd%27s_algorithm Lloyd's algorithm]: <br /><br />
<br />
# Select the number of clusters m <br /><br />
<br />
# Randomly select m observations from the n observations, to be used as m initial centers. <br /><br />
<br />
# (Alternative): Randomly assign all data points to clusters and use the means of those clusters as the initial centers. <br /><br />
<br />
# For each data point from the rest of observations, compute the distance to each of the initial centers and classify it into the cluster with the minimum distance. <br /><br />
<br />
# Obtain updated cluster centers by computing the mean of all the observations in the corresponding clusters.<br /> <br />
<br />
# Repeat Step 3 and Step 4 until all of the differences between the old cluster centers and new cluster centers are acceptable.<br /><br />
<br />
Note: K means can be sensitive to the originally selected points, so it may be useful to run K-means repeatedly and use prior knowledge to select the best cluster.<br />
<br />
Having constructed the basis functions, next minimize the objective function with respect to <math>\mathbf{W}</math>:<br /><br />
<math> min \;\left|| Y-\phi*W\right ||_2^{2}</math><br />
<br />
The solution to the problem is<br />
<math>\ <br />
W=(\phi^T*\phi)^{-1}*\phi^T*Y <br />
</math><br />
<br />
Matlab example:<br />
<br />
clear all;<br />
clc;<br />
load ionosphere.mat;<br />
P=ionosphere(:,1:(end-1));<br />
P=P';<br />
T=ionosphere(:,end);<br />
T=T';<br />
net=newff(minmax(P),[4,1],{'logsig','purelin'},'trainlm'); <br />
net.trainParam.show=100;<br />
net.trainParam.mc=0.9; <br />
net.trainParam.mu=0.05; <br />
net.trainParam.mu_dec=0.1;<br />
net.trainParam.mu_inc=5;<br />
net.trainParam.lr=0.5;<br />
net.trainParam.goal=0.01; <br />
net.trainParam.epochs=5000; <br />
net.trainParam.max_fail=10;<br />
net.trainParam.min_grad=1e-20; <br />
net.trainParam.mem_reduc=2;<br />
net.trainParam.alpha=0.1;<br />
net.trainParam.delt_inc=1;<br />
net.trainParam.delt_dec=0.1;<br />
net=init(net); <br />
net,tr]=train(net,P,T); <br />
A = sim(net,P); <br />
E = T - A; <br />
disp('the training error:')<br />
MSE=mse(E)<br />
<br />
===Single Basis Function vs. Multiple Basis Functions===<br />
Suppose the data points belong to a mixture of Gaussian distributions.<br /><br />
<br />
Under '''single basis''' function approach, every class in <math>\mathbf{Y}</math> is represented by a single basis function. This approach is similar to the approach of linear discriminant analysis. <br /><br />
<br />
Compare <math>y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))</math><br /><br />
with <math>P(Y|X)=\frac{P(X|Y)*P(Y)}{P(X)}</math>. <br /> Here, the basis function <math>\mathbf{\phi}_{j}</math> can be thought of as equivalent to <math>\frac{P(X|Y)}{P(X)}</math>.<br /><br />
<br />
Under '''multiple basis''' function approach, a layer of j basis functions are placed between <math>\mathbf{Y}</math> and <math>\mathbf{X}</math>. The probability function of the joint distribution of <math>\mathbf{X}</math>, <math>\mathbf{J}</math> and <math>\mathbf{Y}</math> is<br />
<br />
<math>\,P(X,J,Y)=P(Y)*P(J|Y)*P(X|J)</math><br />
<br />
Here, instead of using single Gaussian to represent each class, we use a "mixture of Gaussian" to represent.<br /><br />
The probability funcion of <math>\mathbf{Y}</math> conditional on <math>\mathbf{X}</math> is<br />
<br />
<math>P(Y|X)=\frac{P(X,Y)}{P(X)}=\frac{\sum_{j}{P(X,J,Y)}}{P(X)}</math><br />
<br />
Multiplying both the nominator and the denominator by <math>\ P(J) </math> yields <br />
<br />
<math>\ P(Y|X)=\sum_{j}{P(J|X)*P(Y|J)}</math><br /><br />
where <math>\ P(J|X)</math> tells that, with given X (data), how likely the data is in the Gaussian J, and <math>\ P(Y|J)</math> tells that, with given Gaussian J, how likely this Gaussian belongs to class K.<br />
<br />
<br />
since<br /><br />
<math>\ P(J|X)=\frac{P(X|J)*P(J)}{P(X)}</math> <br />
and <math>\ P(Y|J)=\frac{P(Y|J)*P(Y)}{P(J)}</math><br />
<br />
If the weights in the radial basis neural network have proper properties of probability function, then the basis function <math>\mathbf{\phi}_j</math> can be thought of as <math>\ P(J|X)</math>, representing the probability that <math>\mathbf{x}</math> is in Gaussian class j; and the weight function W can be thought of as <math>\ P(Y|J)</math>, representing the probability that a data point belongs to class k given that the point is from Gaussian class j.<br /><br />
<br />
In conclusion, given a mixture of Gaussian distributions, multiple basis function approach is better than single basis function, since the former produces a non-linear boundary.<br />
<br />
== RBF Network Complexity Control (Lecture: Oct. 25, 2011) ==<br />
<br />
When performing model selection, overfitting is a common issue. As model complexity increases, there comes a point where the model becomes worse and worse at fitting real data even though it fits the training data better. It becomes too sensitive to small perturbations in the training data that should be treated as noise to allow flexibility in the general case. In this section we will show that training error (empiricial error from the training data) is a poor estimator for true error and that minimizing training error will increase complexity and result in overfitting. We will show that test error (empirical error from the test data) is a better estimator of true error. This will be done by estimating a model <math> \hat f </math> given training data <math> T={(x_i,y_i)}^n_{i=1}</math>.<br />
<br />
<br />
First, some notation is defined. <br />
<br />
The assumption for the training data set is that it consists of the true model values <math>\ f(x_i) </math> plus some additive Gaussian noise <math>\ \epsilon_i </math>:<br />
<br />
<math>\ y_i = f(x_i)+\epsilon_i</math> where <math>\ \epsilon \sim N(0,\sigma^2)</math><br />
<br />
<math>\ y_i = true\,model + noise</math><br />
<br />
===Important Notation===<br />
<br />
Let:<br />
*<math>\displaystyle f(x)</math> denote the ''true model''.<br />
*<math>\hat f(x)</math> denote the ''prediction/estimated model'', which is generated from a training data set <math>\displaystyle T = \{(x_i, y_i)\}^n_{i=1}</math>. The observation is not accurate.<br /><br />
Remark: <math>\hat f(x_i) = \hat y_i</math>.<br /><br />
*<math>\displaystyle err</math> denote the ''empirical error'' based on actual data points. This can be either test error or training error depending on the data points used. This is the difference between <math>(y-\hat{y})^2 </math><br />
*<math>\displaystyle Err </math> denote the ''true error'' or ''generalization error'', and is what we are trying to minimize. It is the difference between <math>(f-\hat{f})^2 </math><br />
*<math>\displaystyle MSE=E[(\hat f(x)-f(x))^2]</math> denote the ''mean squared error''.<br />
<br />
We use the training data to estimate our model parameters.<br />
<br />
<math>D=\{(x_i,y_i)\}_{i=1}^n</math><br />
<br />
<br />
For a given point <math>y_0</math>, the expectation of the empirical error is:<br />
<br />
<math> \begin{align}<br />
<br />
E[(\hat{y_0}- y_0)^2] &= E[(\hat{f_0}- f_0 -\epsilon_0)^2] \\<br />
&=E[(\hat{f_0}-f_0)^2 + \epsilon_0^2 - 2 \epsilon_0 (\hat{f_0}-f_0)] \\<br />
&=E[(\hat{f_0}-f_0)^2] + E[\epsilon_0^2] - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] \\<br />
&=E[(\hat{f_0}-f_0)^2] + \sigma^2 - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] <br />
\end{align}<br />
</math><br />
<br />
This is the formula partitions the training error into the true error and others errors. Our goal is to select the model that minimizes the true error so we must try to understand the effects of these other error terms if we are to use training error as a estimate for the true error. <br />
<br />
The first term is essentially true error. The second term is a constant. The third term is problematic, since in general this expectation is not 0. We will break this into 2 cases to simplify the third term.<br />
<br />
=====Case 1: Estimating Error using Data Points from Test Set=====<br />
In Case 1, the empirical error is test error and the data points used to calculate test error are from the test set, not the training set. That is, <math>y_0 \notin T </math>.<br />
<br />
We can rewrite the third term in the following way, since both <math>y_0</math> and <math>\hat{f_0}</math> have expectation <math>f_0</math>, the true value, which is a constant and not random.<br />
<br />
<math> \begin{align} <br />
E [ \epsilon_0 (\hat{f_0}-f_0)] &= E [ (y_0-f_0) (\hat{f_0}-f_0)] \\<br />
& = cov{(y_0,\hat{f_0})}<br />
\end{align}<br />
</math><br />
<br />
(The reason why covariance is here since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent.)<br />
<br />
Consider <math>\ f_0 </math> is a mean.<br />
<br />
Since <math>y_0</math> is not part of the training set, it is independent of the model <math>\hat{f_0}</math> generated by the training set. Therefore,<br />
<br />
<math>y_0 \notin T \to y_0 \perp \hat{f} </math><br />
<br />
<math>\ cov{(y_0,\hat{f}_0)}=0</math><br />
<br />
<br />
The equation for the expectation of empirical error simplifies to the following:<br />
<br />
<math>E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2 </math><br />
<br />
<br />
This result applies to every output value in the test data set, so we can generalize this equation by summing over all m data points that have NOT been seen by the model:<br />
<br />
<math>\begin{align}<br />
\sum_{i=1}^m{(y_i-\hat{y_i})^2} &= \sum_{i=1}^m{(f_i-\hat{f_i})^2)} + m \sigma^2 \\<br />
err &= Err + m \sigma^2 \\<br />
& = Err + constant\\<br />
\end{align}<br />
</math><br />
<br />
Rearranging to solve for true error, we get<br />
<br />
<math>\ Err = err - m \sigma^2</math><br />
<br />
We see that test error is a good estimator for true error upto a constant additive value, since they only differ by a constant. Minimizing test error is equal to minimize true error. Moreover, the true error is less than the empirical error. There is no term adding unnecessary complexity. This is the justification for Cross Validation.<br />
<br />
To avoid over-fitting or under-fitting using cross-validation, a validation data set selected so that it is independent from the estimated model.<br />
<br />
===Case 2: Estimating Error using Data Points from Training Set===<br />
<br />
In Case 2, the data points used to calculate error are from the training set, so <math>\ y_0 \in T </math>, i.e. <math>\ (x_i, y_i)</math> is in the training set. We will show that this results in a worse estimator for true error.<br />
<br />
Now <math>\ y_0</math> has been used to estimate <math>\ \hat{f}</math> so they are not independent. We use [http://en.wikipedia.org/wiki/Stein's_lemma Stein's lemma] to simplify the term <math>\ E[\epsilon_0 (\hat{f_0} - f_0)]</math>.<br />
<br />
Stein's Lemma states that if <math>\ x \sim N(\theta,\sigma^2)</math> and <math>\ g(x)</math> is differentiable, then <br />
<br />
<math>E\left[g(x) (x - \theta)\right] = \sigma^2 E \left[ \frac{\partial g(x)}{\partial x} \right] </math><br />
<br />
Substitute <math>\ \epsilon_0</math> for <math>\ x</math> and <math>\ (\hat{f_0}-f_0)</math> for <math>\ g(x)</math>. Note that <math>\ \hat{f_0}</math> is a function of the noise, since as noise changes, <math>\hat{f_0}</math> will change. Using Stein's Lemma, we get:<br />
<br />
<math><br />
\begin{align}<br />
E[\epsilon_0 (\hat{f_0}-f_0)] &= \sigma^2 E \left[ \frac{\partial (\hat{f_0}-f_0)}{\partial \epsilon_0} \right]\\<br />
&=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial \epsilon_0}\right]\\<br />
&=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial y_0}\right]\\<br />
&=\sigma^2 E\left[D_0\right]<br />
\end{align}<br />
</math><br />
<br />
<br />
Remark: <math> \frac{\partial (\hat{f_0} - f_0)}{\partial y_0} = \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial \epsilon_0}{\partial y}<br />
= \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial (y_0 - \hat{y_0})}{\partial y} </math> <br /><br />
<br />
The reason why <math> \frac{\partial (\hat{f_0})}{\partial \epsilon_0} = 0 </math> is that <math>f_0</math> is a constant instead of a function.<br /> <br />
<br />
where <math> \frac{\partial (y_0 - \hat{y_0})}{\partial y} = 1 </math> <br />
<br />
<br />
We take <math>\ D_0 = \frac{\partial \hat{f_0}}{\partial y_0}</math>, where <math>\ D_0</math> represents the derivative of the fitted model with respect to the observations. The equation for the expectation of empirical error becomes:<br />
<br />
<math>E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2 - 2 \sigma^2 E[D_0] </math><br />
<br />
Generalizing the equation for all n data points in the training set:<br />
<br />
<math><br />
\sum_{i=1}^n{(y_i-\hat{y_i})^2} = \sum_{i=1}^n{(f_i-\hat{f_i})^2} + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}<br />
</math><br />
<br />
Based on the notation defined above, we then have:<br />
<br />
<math><br />
err = Err + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}<br />
</math><br />
<br />
<math>Err = err - n \sigma^2 + 2 \sigma^2 \sum_{i=1}^n{D_i}</math><br />
<br />
This equation for the true error is called [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimator (SURE)]. It is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter and thus cannot be determined completely. <br />
<br />
Note that <math>\ D_i</math> depends on complexity of the model. It measures how sensitive the model is to small perturbations in a single <math>\ y_i</math> in the training set. As complexity increases, the model will try to chase every little change and will be more sensitive to such perturbations. Minimizing training error without accounting for the impact of this term will result in overfitting. Thus, we need to know how to find <math>\ D_i</math>. Below we show an example, applying SURE to RBFs, where computing <math>\ D_i</math> is straightforward.<br />
<br />
=== SURE for RBF Network Complexity Control===<br />
Problem: Assuming we want to fit our data using a radial basis function network, how many radial basis functions should be used? The network size has to compromise the approximation quality, which usually improves as the network grows, and the training effort, which increases with the network size. Moreover, too complex models can show insufficient generalization properties (overfitting) requiring small networks. Furthermore, in terms of hardware or software realization smaller networks occupy less area due to reduced memory needs. Hence, controlling the network size is one major task during training. For further information about RBF network complexity control check [http://www.dice.ucl.ac.be/Proceedings/esann/esannpdf/es2007-13.pdf]<br />
<br />
We can use Stein's unbiased risk estimator (SURE) to give us an approximation for how many RBFs to use.<br />
<br />
The SURE equation is<br />
<br />
<math>\mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2\sum_{i=1}^n D_i</math><br />
<br />
where <math>\ Err </math> is the true error, <math>\ err </math> is the empirical error, <math>\ n</math> is the number of training samples, <math>\ \sigma^2</math> is the variance of the noise of the training samples and <math>\ D_i</math> is derivative of the model output with respect to true output as shown below<br />
<br />
<math>D_i=\frac{\partial \hat{f_i}}{\partial y_i}</math><br />
<br />
Optimal Number of Basis in RBF<br />
<br />
The optimal number of basis functions should be rearranged in order to minimize the generalization error <math>\ err </math>.<br />
<br />
The formula for an RBF network is:<br />
<br />
<math>\hat{f}=\Phi W</math><br />
<br />
where <math>\ \hat{f}</math> is a matrix of RBFN outputs for each training sample, <math>\ \Phi</math> is the matrix of neuron outputs for each training sample, and <math>\ W</math> is the weight vector between each neuron and the output. Suppose we have m + 1 neurons in the network, where one has a constant function.<br />
<br />
Given the training labels <math>\ Y</math> we define the empirical error and minimize it<br />
<br />
<math>\underset{W}{\mbox{min}} |Y-\Phi W|^2</math><br />
<br />
<math>\, W=(\Phi^T \Phi)^{-1} \Phi^T Y</math><br />
<br />
<math>\hat{f}=\Phi(\Phi^T \Phi)^{-1} \Phi^T Y</math><br />
<br />
<br />
For simplification let <math>\ H</math> be the ''hat matrix'' defined as<br />
<br />
<math>\, H=\Phi(\Phi^T \Phi)^{-1} \Phi^T</math><br />
<br />
Our optimal output then becomes<br />
<br />
<math>\hat{f}=H Y</math><br />
<br />
We calculate <math>D</math> from the SURE equation. We now consider applying SURE to Radial Basis Function networks specifically. Based on SURE, the optimum number of basis functions should be assigned so that the generalization error <math>\displaystyle err</math> is minimized. Based on the RBF Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero we obtain the least squares solution of <math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>. Then the fitted values are <math>\hat{Y} = \hat{f} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math>, where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math> is the hat matrix for this model.<br />
<br />
<br />
Consider only one node of the network. In this case we can write:<br />
<math>\hat f_i=\,H_{i1}y_1+\,H_{i2}y_2+\cdots+\,H_{ii}y_i+\cdots+\,H_{in}y_n</math>.<br />
<br />
Note here that <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on the observation <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\ \hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can readily obtain:<br />
<br />
<math>\sum_{i=1}^n \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^n \,H_{ii}</math><br />
<br />
<math>D_i= \frac{\partial \hat f_i}{\partial y_i}=\frac{\partial [HY]_i}{\partial y_i} </math> , <math>\hat f_i=\sum_{j}\,H_{ij}*Y_j</math><br />
<br />
<br />
Here we recall that <math>\sum_{i=1}^n\,D_{i}= \sum_{i=1}^n \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Using the permutation property of the trace function we can further simplify the expression as follows:<br />
<math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1}\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=m</math>, by the trace cyclical permutation property, where <math>\displaystyle m</math> is the number of basis functions in the RBF network (and hence <math>\displaystyle \Phi</math> has dimension <math>\displaystyle n \times m</math>).<br><br />
<br />
====Sketch of Trace Cyclical Property Proof:====<br />
For <math>\, A_{mn}, B_{nm}, Tr(AB) = \sum_{i=1}^{n}\sum_{j=1}^{m}A_{ij}B_{ji} = \sum_{j=1}^{m}\sum_{i=1}^{n}B_{ji}A_{ij} = Tr(BA)</math>.<br><br />
With that in mind, for <math>\, A_{nn}, B_{nn} = CD, Tr(AB) = Tr(ACD) = Tr(BA)</math> (from above) <math>\, = Tr(CDA)</math>.<br><br><br />
<br />
Note that since <math>\displaystyle \Phi</math> is a projection of the input matrix <math>\,X</math> onto a basis set spanned by <math>\,m</math>, the number of basis functions, that sometimes an extra <math>\displaystyle \Phi_0</math> term is included without any input to represent the intercept of a fitted model. In this case, if considering an intercept, then <math>\,Trace(H)= m+1</math>.<br />
<br />
<br />
The SURE equation then becomes<br />
<br />
<math>\, \mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2(m+1)</math><br />
<br />
As the number of RBFs <math>\ m</math> increases the empirical error <math>\ err</math> decreases, but the right term of the SURE equation increases. An optimal true error <math>\ Err </math> can be found by increasing <math>\ m</math> until <math>\ Err </math> begins to grow. At that point the estimate to the minimum true error has been reached.<br />
<br />
The value of m that gives the minimum true error estimate is the optimal number of basis functions to be implemented in the RBF network, and hence is also the optimal degree of complexity of the model. <br />
<br />
One way to estimate the noise variance is<br />
<br />
<math>\hat{\sigma}^2=\frac{\sum (y-\hat{y})^2}{n-1}</math><br />
<br />
This application of SURE is straightforward because minimizing Radial Basis Function error reduces to a simple least squares estimator problem with a linear solution. This makes computing <math>\ D_i</math> quite simple. In general, <math>\ D_i</math> can be much more difficult to solve for.<br />
<br />
=== RBF Network Complexity Control (Alternate Approach) ===<br />
<br />
An alternate approach (not covered in class) to tackling RBF Network complexity control is controlling the complexity by similarity <ref name="Eickhoff">R. Eickhoff and U. Rueckert, "Controlling complexity of RBF networks by similarity," ''Proceedings of European Symposium on Artificial Neural Networks'', 2007</ref>. In <ref name="Eickhoff" />, the authors suggest looking at the similarity between the basis functions multiplied by their weight by determining the cross-correlations between the functions. The cross-correlation is calculated as follows:<br />
<br />
<math>\ \rho_{ij} = \frac{E[g_i(x)g_j(x)]}{\sqrt(E[g^2_i(x)]E[g^2_j(x)])} </math><br />
<br />
where <math>\ E[] </math> denotes the expectation and <math>\ g_i(x) </math> and <math>\ g_j(x) </math> would denote two of the basis functions multiplied by their respective weights.<br />
<br />
If the cross-correlation between two functions is high, <ref name="Eickhoff" /> suggests that the two basis functions be replaced with one basis function that covers the same region of both basis functions and that the corresponding weight of this new basis function be the average of the weights of the two basis functions. For the case of Gaussian radial basis functions, the equations for finding the new weight (<math>\ w_{new} </math>), mean (<math>\ c_{new} </math>) and variance (<math>\ \sigma_{new} </math>) are as follows:<br />
<br />
<math>\ w_{new} = \frac{w_i + w_j}{2} </math><br />
<br />
<math>\ c_{new} = \frac{1}{w_i \sigma^n_i + w_j \sigma^n_j}(w_i \sigma^n_i c_i + w_j \sigma^n_j c_j)</math><br />
<br />
<math>\ \sigma^2_{new} = \left(\frac{\sigma_i + \sigma_j}{2}+ \frac{min(||m-c_i||,||m-c_j||)}{2}\right)^2</math><br />
<br />
where <math>\ n </math> denotes the input dimension and <math>\ m </math> denotes the total number of radial basis functions.<br />
<br />
This process is repeated until the cross-correlation between the basis functions falls below a certain threshold, which is a tunable parameter. <br />
<br />
Note 1) Though not extensively discussed in <ref name="Eickhoff" />, this approach to RBF Network complexity control presumably requires a starting RBF Network with a large number basis functions.<br />
<br />
Note 2) This approach does not require the repeated implementation of differently sized RBF Networks to determine the empirical error, unlike the approach using SURE. However, the SURE approach is backed up by theory to find the number of radial basis functions that optimizes the true error and does not rely on some tunable threshold. It would be interesting to compare the results of both approaches (in terms of the resulting RBF Network obtained and the test error).<br />
<br />
<br />
===Generalized SURE for Exponential Families===<br />
As we know, Stein’s unbiased risk estimate (SURE) is limited to be applied for the independent, identically distributed (i.i.d.) Gaussian model. However, in some recent work, some researchers tried to work on obtaining a SURE counterpart for general, instead of deriving estimate by dominating least-squares estimation, and this technique made SURE extend its application to a wider area. <br />
<br />
You may look at Yonina C. Eldar, Generalized SURE for Exponential Families: Applications to Regularization, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 2, FEBRUARY 2009 for more information.<br />
<br />
===Further Reading===<br />
Fully Tuned Radial Basis Function Neural Networks for Flight Control<br />
<ref><br />
http://www.springer.com/physics/complexity/book/978-0-7923-7518-0;jsessionid=985F21372AC7AE1B654F1EADD11B296F.node3<br />
</ref><br />
<br />
Paper about the BBFN for multi-task learning <ref>http://books.nips.cc/papers/files/nips18/NIPS2005_0628.pdf</ref><br />
<br />
Radial Basis Function (RBF) Networks <ref>http://documents.wolfram.com/applications/neuralnetworks/index6.html</ref> <br />
<br />
An Example of RBF Networks <ref>http://reference.wolfram.com/applications/neuralnetworks/ApplicationExamples/12.1.2.html</ref><br />
<br />
This paper suggests an objective approach in determining proper samples to find good RBF networks with respect to accuracy <ref>http://www.wseas.us/e-library/conferences/2009/hangzhou/MUSP/MUSP41.pdf</ref>.<br />
<br />
== Support Vector Machines (Lecture: Oct. 27, 2011) ==<br />
<br />
[[Image:SVM.png|right|thumb|A series of linear classifiers, H2 represents a SVM, where the SVM attempts to maximize the margin, the distance between the closest point in each data set and the linear classifier.]]<br />
<br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support vector machines] (SVMs), also referred to as max-margin classifiers, are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. SVMs are kernel machines based on the principle of structural risk minimization, which are used in applications of regression and classification; however, they are mostly used as binary classifiers. Although the subject can be said to have started in the late seventies (Vapnik, 1979), it is receiving increasing attention recently by researchers. It is such a powerful method that in the few years since its introduction has outperformed most other systems in a wide variety of applications, especially in pattern recognition.<br />
<br />
The current standard incarnation of SVM is known as "soft margin" and was proposed by Corinna Cortes and Vladimir Vapnik [http://en.wikipedia.org/wiki/Vladimir_Vapnik]. In practice the data is not usually linearly separable. Although theoretically we can make the data linearly separable by mapping it into higher dimensions, the issues of how to obtain the mapping and how to avoid overfitting are still of concern. A more practical approach to classifying non-linearly separable data is to add some error tolerance to the separating hyperplane between the two classes, meaning that a data point in class A can cross the separating hyperplane into class B by a certain specified distance. This more generalized version of SVM is the so-called "soft margin" support vector machine and is generally accepted as the standard form of SVM over the hard margin case in practice today. [http://en.wikipedia.org/wiki/Support_vector_machine#Soft_margin]<br />
<br />
Support Vector Machines are motivated by the idea of training linear machines with margins. It involves preprocessing the data to represent patterns in a high dimension (generally much higher than the original feature space). Note that using a suitable non-linear mapping to a sufficiently high dimensional space, the data will always be separable. (p. 263) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref><br />
<br />
A suitable way to describe the interest in SVM can be seen in the following quote. "The problem which drove the initial development of SVMs occurs in several guises - the bias variance tradeoff (Geman, Bienenstock and Doursat, 1992), capacity control (Guyon et al., 1992), overfitting (Montgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking, for a given learning task, with a given finite amount of training data, the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set, and the “capacity” of the machine, that is, the ability of the machine to learn any training set without error. A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well. The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning (Vapnik, 1979). [http://research.microsoft.com/pubs/67119/svmtutorial.pdf A Tutorial on Support Vector Machines for Pattern Recognition]<br />
<br />
===== Support Vector Method Solving Real-world Problems=====<br />
<br />
No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. <br />
<br />
SVM can also be considered a special case of [http://en.wikipedia.org/wiki/Tikhonov_regularization Tikhonov regularization]. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). <br />
<br />
Another key feature of SVM, as discussed below, is the use of [http://en.wikipedia.org/wiki/Slack_variable slack variables] to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data. SVM has been used successfully in many real-world problems:<br />
<br />
- Pattern Recognition, such as Face Detection , Face Verification, Object Recognition, Handwritten Character/Digit Recognition, Speaker/Speech Recognition, Image Retrieval , Prediction;<br />
<br />
- Text and Hypertext categorization;<br />
<br />
- Image classification;<br />
<br />
- Bioinformatics, such as Protein classification, Cancer classification;<br />
<br />
Please refer to [http://www.clopinet.com/isabelle/Projects/SVM/applist.html here] for more applications.<br />
<br />
===== Structural Risk Minimization and VC Dimension =====<br />
<br />
Linear learning machines are the fundamental formulations of SVMs. The objective of the linear learning machine is to find the linear function that minimizes the generalization error from a set of functions which can approximate the underlying mapping between the input and output data. Consider a learning machine that implements linear functions in the plane as decision rules<br />
<br />
<math>f(\mathbf{x},\boldsymbol{\beta}, \beta_0)=sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)</math><br />
<br />
<br />
With ''n'' given training data with input values <math>\mathbf{x}_i \in \mathbb{R}^d</math> and output values <math>y_i\in\{-1,+1\}</math>. The empirical error is defined as<br />
<br />
<math>\Re_{emp} (\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n |y_i-f(\mathbf{x},\boldsymbol{\beta}, \beta_0)|= \frac{1}{n}\sum_{i=1}^n |y_i-sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)|</math><br />
<br />
<br />
where <math>\boldsymbol{\theta}=(\mathbf{x},\boldsymbol{\beta})</math><br />
<br />
The generalization error can be expressed as<br />
<br />
<math> \Re (\boldsymbol{\theta}) = \int|y-f(\mathbf{x},\boldsymbol{\theta})|p(\mathbf{x},y)dxdy</math><br />
<br />
which measures the error for all input/output patterns that are generated from the underlying generator of the data characterized by the probability distribution <math>p(\mathbf{x},y)</math> which is considered to be unknown.<br />
According to statistical learning theory, the generalization (test) error can be upper bounded in terms of training error and a confidence term as shown in<br />
<br />
<math>\Re (\boldsymbol{\theta})\leq \Re_{emp} (\boldsymbol{\theta}) +\sqrt{\frac{h(ln(2n/h)+1)-ln(\eta/4)}{n}}</math><br />
<br />
<br />
The term on left side represents generalization error. The first term on right hand side is empirical error calculated from the training data and the second term is called ''VC confidence'' which is associated with the ''VC dimension'' h of the learning machine. [http://en.wikipedia.org/wiki/Vc_dimension VC dimension] is used to describe the complexity of the learning system. The relationship between these three items is illustrated in figure below:<br />
<br />
<br />
[[File:risk.png|400px|thumb|centre| The relation between expected risk, empirical risk and VC confidence in SVMs.]]<br />
<br />
<br />
Thus, even though we don’t know the underlying distribution based on which the data points are generated, it is possible to minimize the upper bound of the generalization error in place of minimizing the generalization error. That means one can minimize the expression in the right hand side of the inequality above.<br />
<br />
Unlike the principle of Empirical Risk Minimization (ERM) applied in Neural Networks which aims to minimize the training error, SVMs implement Structural Risk Minimization (SRM) in their formulations. SRM principle takes both the training error and the complexity of the model into account and intends to find the minimum of the sum of these two terms as a trade-off solution (as shown in figure above) by searching a nested set of functions of increasing complexity.<br />
<br />
=====Introduction=====<br />
<br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine]is a popular linear classifier. Suppose that we have a data set with two classes which could be separated using a hyper-plane. Support Vector Machine (SVM) is a method which will give us the "best" hyper-plane. There are other classifiers that find a hyper-plane that separate the data, namely Perceptron. However, the output of Perceptron and many other algorithms depends on the input parameters, so every run of Percetron can give you a different output. On the other hand, SVM tries to find the hyper-plane that separates the data and have the farthest distance from the points. This is also known as the Max-Margin hyper-plane.<br />
<br />
No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. SVM can also be considered a special case of [http://en.wikipedia.org/wiki/Tikhonov_regularization Tikhonov regularization]. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). Another key feature of SVM, as discussed below, is the use of [http://en.wikipedia.org/wiki/Slack_variable slack variables] to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data.<br />
<br />
<gallery><br />
Image:KwebsterIntroDiagram.png|Infinitely many Perceptron solutions<br />
Image:CorrectChoice.png|Out of many how do we choose?<br />
</gallery><br />
<br />
<br />
With Perceptron, there can be infinitely many separating hyperplanes such that the training error will be zero. But the question is that among all these possible solution which one is the best. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. This makes sense because at test time, more points will be observed and they may be closer to the other class, so the safest choice for the hyper-plane would be the one farthest from both classes.<br />
<br />
One of the great things about SVM is that not only it has solid theoretical guarantees, but also it works very well in practice. <br />
<br />
'''To summarize'''<br />
<br />
[[Image:Margin.png|right|thumb|What we mean by margin is the distance between the hyperplane and the closest point in a class.]]<br />
<br />
If the data is Linearly separable, then there exists infinitely many solution hyperplanes. Of those, infinitely many hyperplanes, one of them is the best choice for the solution. Then the best decision to make is the hyperplane which is furthest from both classes. Our goal is to find a hyperplane among all possible hyperplanes which is furthest from both classes. This is to say, find the hyperplane that has maximum margin. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.<br />
<br />
What we mean by margin is the distance between the hyperplane and the closest point in a class.<br />
<br />
<!--<br />
If the mean value were to be used instead of the closest point, then an outlier may pull the hyperplane into the data which would incorrectly classify the known data points<br />
<gallery><br />
Image:NotMean.png|This is the reason why we use the closest point instead of the expected value.<br />
</gallery><br />
--><br />
[[Image:NotMean.png|right|thumb|If the mean value were to be used instead of the closest point, then an outlier may pull the hyperplane into the data which would incorrectly classify the known data points. This is the reason why we use the closest point instead of the expected value.]]<br />
<br />
===== Setting=====<br />
<br />
[[Image:Thedis.png|right|thumb|What is <math> d_i </math>]]<br />
<br />
* We assume that the data is linearly separable<br />
* Our classifier will be of the form <math> \boldsymbol\beta^T\mathbf{x} + \beta_0 </math><br />
* We will assume that our labels are <math> y_i \in \{-1,1\} </math><br />
<br />
<br />
<br />
The goal is to classify the point <math> \mathbf{x_i} </math> based on the <math>sign \{d_i\}</math> where <math>d_i</math> is the signed distance between <math> \mathbf{x_i}</math> and the hyperplane.<br />
<br />
<!-- Comments --><br />
<!--<br />
<gallery><br />
Image:Thedis.png|What is <math> d_i </math><br />
</gallery><br />
--><br />
<br />
Now we are going to check how far this point is from the hyperplane, and the parts on one side of the hyperplane will have a negative value and the parts on the other side will have a positive value. Points are classified by the sign of the data point. So <math>\mathbf{x_i}</math> would be classified using <math>d_i</math><br />
<br />
===Side Note: A memory from the past of Dr. Ali Ghodsi===<br />
When the aforementioned Professor was a small child, grade 2. He was often careless with the accuracy of certain curly brackets, when writing what one can only assume was math proofs. One day, his teacher grew impatient and demanded that a page of perfect curly brackets be produced by the young Dr. (He may or may not have been a doctor at the time) And now, whenever Dr. Ghodsi writes a tidy curly bracket, he is reminded of this and it always brings a smile to his face. <br />
<br />
From memories of the past.<br />
<br />
(the number 20 was involved in the story, either the number of pages or the number of lines)<br />
<br />
===== Case 1: Linearly Separable (Hard Margin) =====<br />
<br />
In this case, the classifier will be <math>\boldsymbol {\beta^T} \boldsymbol {x} + \beta_0 </math> and <math>\ y \in \{-1, 1\} </math>.<br />
The point <math>\boldsymbol {x_i}</math> to classify is based on the sign of <math>\ \{d_i\}</math>, where <math>\ d_i </math> is the signed distance between <math>\boldsymbol {x_i}</math> and the hyperplane.<br />
<br />
===== Objective Function =====<br />
[[Image:X1X2perpBeta.png|right|thumb|Look at it being perpendicular]]<br />
'''Observation 1:''' <math>\boldsymbol\beta</math> is orthogonal to hyper-plane. Because, for any two arbitrary points <math>\mathbf{x_1, x_2}</math> on the plane we have:<br />
<br />
<math> \boldsymbol\beta^T\mathbf{x_1} + \beta_0 = 0 </math><br />
<br />
<math> \boldsymbol\beta^T\mathbf{x_2} + \beta_0 = 0 </math><br />
<br />
So <math>\boldsymbol\beta^T (\boldsymbol{x_1}-\boldsymbol{x_2}) = 0</math>. Thus, <math> \boldsymbol\beta \perp (\boldsymbol{x_1} - \boldsymbol{x_2}) </math>, which implies that <math>\boldsymbol \beta</math> is a normal vector to the hyper-plane.<br />
<br />
<br />
'''Observation 2:''' If <math>\boldsymbol x_0</math> is a point on the hyper-plane, then there exists a <math>\ \beta_0 </math> such that, <math>\boldsymbol\beta^T\boldsymbol{x_0}+\beta_0 = 0</math>. So <math>\boldsymbol\beta^T\boldsymbol{x_0} = - \beta_0</math>. This along with observation 1 imply there exists a <math>\ \beta_0 </math> such that, <math>\boldsymbol\beta^T\boldsymbol{x} = - \beta_0</math> for all <math> \boldsymbol{x} </math> on the hyperplane.<br />
<br />
<br />
'''Observation 3:''' Let <math>\ d_i</math> be the signed distance of point <math>\boldsymbol{x_i}</math> from the plane. The <math>\ d_i</math> is the projection of <math>(\boldsymbol{x_i} - \boldsymbol{x_0})</math> on the direction of <math>\boldsymbol\beta</math>. In other words, <math> d_i \propto \boldsymbol\beta^T(\mathbf{x - x_0}) </math>.(normalize <math>\beta</math>)<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle d_i &= \frac{\boldsymbol\beta^T(\boldsymbol{x_i} - \boldsymbol{x_0})}{\vert \boldsymbol\beta\vert}\\ <br />
& = \frac{\boldsymbol{\beta^Tx_i}- \boldsymbol{\beta^Tx_0}}{\vert \boldsymbol\beta\vert}\\<br />
& = \frac{\boldsymbol{\beta^Tx_i}+ \beta_0}{\vert \boldsymbol\beta\vert}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Observation 4:''' Let margin be the distance between the hyper-plane and the closest point. Since <math> d_i </math> is the signed distance between the hyperplane and point <math>\boldsymbol{x_i} </math>, we can define the positive distance of point <math>\boldsymbol{x_i} </math> from the hyper-plane as <math>(y_id_i)</math>.<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \text{Margin} &= \min\{y_i d_i\}\\<br />
&= \min\{ \frac{y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0)}{|\boldsymbol\beta|} \}<br />
\end{align}<br />
</math><br />
<br />
Our goal is to maximize the margin. This is also known as the Max/Min problem in Optimization. When defining the hyperplane, what is important is the direction of <math>\boldsymbol\beta</math>. Value of <math>\beta_0</math> does not change the direction of the hyper-plane, it is only the distance from the origin. Note that if we assume that the points do not lie on the hyper-plane, then the margin is positive:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle &y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 0 &&\\<br />
&y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq C &&\mbox{ for some positive C } \\<br />
&y_i(\frac{\boldsymbol\beta^T}{C}\mathbf{x_i} + \frac{\beta_0}{C}) \geq 1 &&\mbox{ Divide by C}\\<br />
&y_i(\boldsymbol\beta^{*T}\mathbf{x_i} + \beta^*_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta^* = \frac{\boldsymbol\beta}{C}, \boldsymbol\beta_0^* = \frac{\boldsymbol\beta_0}{C}\\<br />
&y_i(\boldsymbol\beta^{T}\mathbf{x_i} + \beta_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta\gets\boldsymbol\beta^*, \boldsymbol\beta_0\gets\boldsymbol\beta_0^*\\<br />
\end{align}<br />
</math><br />
<br />
<br />
So with a bit of abuse of notation we can assume that<br />
<br />
<math> y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 1 </math><br />
<br />
Therefore, the problem translates to:<br />
: <math>\, \max\{\frac{1}{||\boldsymbol\beta||}\}</math><br />
<br />
So, it is possible to re-interpret the problem as:<br />
<br />
: <math>\, \min \frac 12 \vert \boldsymbol\beta \vert^2 \quad</math> s.t. <math>\quad \,y_i (\boldsymbol\beta^{T} \boldsymbol{x_i}+ \beta_0) \geq 1 </math><br />
<br />
<math>\, \vert \boldsymbol\beta \vert </math> could be any norm, but for simplicity we use L2 norm. We use <math>\frac 12 \vert \boldsymbol\beta \vert^2</math> instead of <math>|\boldsymbol\beta|</math> to make the function differentiable. To solve the above optimization problem we can use '''Lagrange multipliers''' as follows<br />
<br />
=====Support Vectors=====<br />
<br />
Support vectors are the training points that determine the optimal separating hyperplane that we seek. Also, they are the most difficult points to classify and at the same time the most informative for classification.<br />
<br />
=====Visualizing the Cost Function=====<br />
Recall the cost function for a single example in the logistic regression model:<br />
<br />
<math>-\left( y \log \frac{1}{1+e^{-\beta^T \boldsymbol{x}}} + (1-y)\log \frac{e^{-\beta^T\boldsymbol{x}}}{1+e^{-\beta^T \boldsymbol{x}}} \right)</math><br />
<br />
where <math>y \in \{0,1\}</math>. Looking at the plot of the cost term (for y=1), if <math>y=1</math> (i.e. the target class is 1), then we want our <math>\beta</math> to be such that <math>\beta^T \boldsymbol{x} \gg 0</math>. This will ensure very accurate classification.<br />
<br />
[[Image:logreg_cost.jpg|450px]]<br />
<br />
Now for SVM, consider the generic cost function as follows:<br />
<br />
<math>-\left( y \cdot \text{cost}_1(\beta^T \boldsymbol{x}) + (1-y)\cdot \text{cost}_0(\beta^T \boldsymbol{x}) \right)</math><br />
<br />
We can visualize <math>\text{cost}_1</math> compared with the sigmoid cost term in logistic regression as follows:<br />
<br />
[[Image:svm_cost.jpg|450px]]<br />
<br />
What you should take away from this is for y=1, we want <math>\beta^T \boldsymbol{x}\ge 1</math>. In our notes, we have <math>y \in \{-1, 1\}</math>, so that's why we write <math>y_i (\beta^T \boldsymbol{x} + \beta_0) \ge 1</math>.<br />
<br />
The same rationale can be applied for y=0, using <math>(1-y)\log \frac{1}{1+e^{-\beta^T \boldsymbol{x}}}</math><br />
<br />
=====Writing Lagrangian Form of Support Vector Machine =====<br />
<br />
The Lagrangian form using [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange multipliers] and constraints that are discussed below is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution (the optimal saddle point of the Lagrangian for the [http://en.wikipedia.org/wiki/Quadratic_programming classic quadratic optimization]). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
the Lagrangian function of the above optimization problem:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \vert \boldsymbol\beta \vert^2 - \sum_{i=1}^n \alpha_i \left[ y_i (\boldsymbol{\beta^T x_i}+\beta_0) -1 \right]\\<br />
&= \frac 12 \vert \boldsymbol\beta \vert^2 - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i<br />
\end{align}<br />
</math><br />
<br />
where <math>\boldsymbol\alpha = (\alpha_1 ,... ,\alpha_n) </math> are lagrange multipliers. <math> 0 \le \alpha_{i} i=1...n </math><br />
<br />
To find the optimal value, we set the derivatives equal to zero: <math>\,\frac{\partial L}{\partial \boldsymbol{\beta}} = 0</math> and <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>.<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle &\frac{\partial L}{\partial \boldsymbol{\beta}} = \boldsymbol\beta - \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} = 0 &\Longrightarrow& \boldsymbol\beta = \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}\\<br />
&\frac{\partial L}{\partial \beta_0} = - \sum_{i=1}^n \alpha_i y_i = 0 &\Longrightarrow& \sum_{i=1}^n \alpha_i y_i = 0 <br />
\end{align}<br />
</math><br />
<br />
To get the dual form of the optimization problem we replace the above two equations in definition of <math>L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha)</math>. <br />
<br />
We have:<br />
<math><br />
\begin{align}<br />
\displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \boldsymbol\beta^T\boldsymbol\beta - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i\\<br />
&= \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - 0 + \sum_{i=1}^n \alpha_i\\<br />
&= - \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\<br />
&= - \frac 12 \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\<br />
&= \sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j\boldsymbol{x_i}^T\boldsymbol{x_j}<br />
\end{align}<br />
</math><br />
<br />
The above function is a dual objective function, so we should minimize it:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \max_\alpha &\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j}\\<br />
s.t.\; & \alpha_i \geq 0\\<br />
& \sum_{i=1}^n \alpha_i y_i = 0<br />
\end{align}<br />
</math><br />
<br />
The dual function is a quadratic function of several variables subject to linear constraints. This optimization problem is called Quadratic Programming and is much easier than the primal function. It is possible to to write to dual form using matrices:<br />
<br />
<math><br />
\begin{align}<br />
\displaystyle \max_\alpha \,& \boldsymbol\alpha^T\boldsymbol{1} - \frac 12 \boldsymbol\alpha^T S \boldsymbol\alpha\\<br />
s.t.\; & \boldsymbol\alpha \geq 0\\<br />
& \boldsymbol\alpha^Ty = 0\\<br />
& S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X)<br />
\end{align}<br />
</math><br />
<br />
<br />
Since <math> S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X) </math>, S is a positive semi-definite matrix. This means that the dual function is convex.[http://en.wikipedia.org/wiki/Convex_function]. This means that the dual function does not have any local minimum that is not global. So it is relatively easy to find the global minimum.<br />
<br />
This is a much simpler optimization problem and we can solve it by [http://en.wikipedia.org/wiki/Quadratic_programming Quadratic programming]. Quadratic programming (QP) is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables.<br />
The general form of such a problem is minimize with respect to <math>\,x</math><br />
: <math>f(x) = \frac{1}{2}x^TQx + c^Tx</math><br />
subject to one or more constraints of the form:<br />
<br />
<math>\,Ax\le b</math>, <math>\,Ex=d</math>.<br />
<br />
A good description of general QP problem formulation and solution can be find [http://www.me.utexas.edu/~jensen/ORMM/supplements/methods/nlpmethod/S2_quadratic.pdf link here].<br />
<br />
===== Discussion on the Dual of the Lagrangian =====<br />
As mentioned in the previous section, solving the dual form of the Lagrangian requires quadratic programming. Quadratic programming can be used to minimize a quadratic function subject to a set of constraints. In general, for a problem with N variables, the quadratic programming solution has a computational complexity of <math>\ O(N^3) </math> <br />
<ref name="CMBishop" />. The original problem formulation only has (d+1) variables that need to be found (i.e. the values of <math>\ \beta </math> and <math>\ \beta_0 </math>), where d is the dimensionality of the data points. However, the dual form of the Lagrangian has n variables that need to be found (i.e. all the <math>\ \alpha </math> values), where n is the number of data points. It is likely that n is larger than (d+1) (i.e. the number of data points is larger than the dimensionality of the data plus 1), which makes the dual form of the Lagrangian seem computationally inefficient <ref name="CMBishop" />. However, the dual of the Lagrangian allows the inner product <math>\ x_i^T x_j </math> to be expressed using a kernel formulation which allows the data to be transformed into higher feature spaces and thus allowing seemingly non-linearly separable data points to be separated, which is a highly useful feature described in more detail in the next class <ref name="CMBishop" />.<br />
<br />
===== Support Vector Method Packages=====<br />
<br />
One of the popular Matlab toolboxes for SVM is [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM], which has been developed in the department of Computer Science and Information Engineering, National Taiwan University, under supervision of Chih-Chung Chang and Chih-Jen Lin. In this page they have provided the society with many different interfaces for LIBSVM like Matlab, C++, Python, Perl, and many other languages, each one of those has been developed in different institutes and by variety of engineers and mathematicians. In this page you can also find a thorough introduction to the package and its various parameters.<br />
<br />
A very helpful tool which you can find on the [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM] page is a graphical interface for SVM; it is an applet by which we can draw points corresponding to each of the two classes of the classification problem and by adjusting the SVM parameters, observe the resulting solution.<br />
<br />
If you found LIBSVM helpful and wanted to use it for your research, [http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f203 please cite the toolbox].<br />
<br />
A pretty long list of other SVM packages and comparison between all of them in terms of language, execution platform, multiclass and regression capabilities, can be found [http://www.cs.ubc.ca/~murphyk/Software/svm.htm here].<br />
<br />
The top 3 SVM software are:<br />
<br />
1. LIBSVM<br />
<br />
2. SVMlight<br />
<br />
3. SVMTorch<br />
<br />
More information which introduces SVM software and their comparison can be found [http://www.svms.org/software.html here] and [http://www.support-vector-machines.org/SVM_soft.html here].<br />
<br />
== Support Vector Machine Continued (Lecture: Nov. 1, 2011) ==<br />
<br />
In the previous lecture we considered the case when data is linearly separable. The goal of the Support Vector Machine classifier is to find the hyperplane that maximizes the margin distance from the hyperplane to each of the two classes. We derived the following optimization problem based on the SVM methodology:<br />
<br />
<math>\, \min_{\beta} \frac{1}{2}{|\boldsymbol{\beta}|}^2</math><br />
<br />
Subject to the constraint: <br />
<br />
<math>\,y_i(\boldsymbol{\beta}^T\mathbf{x}_i+\beta_0)\geq1, \quad y_i \in \{-1,1\} \quad \forall{i} =1, \ldots , n</math><br /><br />
<br />
Notice that SVM can only classify 2-class output. Lots of work will be needed for higher classes output. <br />
<br />
This is the primal form of the optimization problem. Then we derived the dual of this problem:<br />
<br />
<math>\, \max_\alpha \quad \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j </math><br />
<br />
Subject to constraints: <br />
<br />
<math>\,\alpha_i\geq 0 </math><br />
<br />
<math>\,\sum_i \alpha_i y_i =0</math><br />
<br />
<br />
The is a quadratic programming problem. QP problems have been thoroughly studied and they can be efficiently solved. This particular problem has a convex objective function as well as convex constraints. This guarantees a global optima, even if we use local optima search algorithms (e.g. gradient descent). These properties are of significant importance for classifiers and thus are one of the most important strengths of the SVM classifier. <br />
<br />
for an easy implementation of SVM and solving above quadratic optimization problem in R see<ref><br />
http://cbio.ensmp.fr/~thocking/mines-course/2011-04-01-svm/svm-qp.pdf<br />
</ref><br />
<br />
We are able to find <math>\,\beta</math> when <math>\,\alpha</math> is found: <br />
<br />
<math>\, \boldsymbol{\beta} = \sum_i \alpha_i y_i \mathbf{x}_i </math><br />
<br />
But in order to find the hyper-plane uniquely we also need to find <math>\,\beta_0</math>. <br />
<br />
When finding the dual objective function, there is a set of conditions called '''KKT''' that should be satisfied.<br />
<br />
=== Examining KKT Conditions ===<br />
KKT stands for [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker] (initially named after Kuhn and Tucker's work in the 1950's, however, it was later discovered that Karush had stated the conditions back in the late 1930's) <ref name="CMBishop" /><br />
<br />
The K.K.T. conditions are as follows: stationarity, primal feasibility, dual feasibility, and complementary slackness.<br />
<br />
It gives us a closer look into the Lagrangian equation and the associated conditions. <br />
<br />
Suppose we want to find <math>\, \min_x f(x)</math> subject to the constraint <math>\, g_i(x)\geq 0 , \forall{x} </math>. The Lagrangian is then computed as:<br />
<br />
<math>\, \mathcal{L} (x,\alpha_i)=f(x)-\sum_i \alpha_i g_i(x) </math><br />
<br />
If <math> \, x^* </math> is the point where <math>\beta</math> is optimal with respect to our cost function, the necessary conditions for <math> \, x^* </math> to be the local minimum :<br />
<br />
1) '''Stationarity''': <math>\, \frac{\partial \mathcal{L}}{\partial x} (x^*) = 0 </math> that is <math>\, f'(x^*) - \Sigma_i{\alpha_ig'_i(x^*)}=0</math><br />
<br />
2) '''Dual Feasibility''': <math>\, \alpha_i\geq 0 , </math><br />
<br />
3) '''Complementary Slackness''': <math>\, \alpha_i g_i(x^*)=0 , </math><br />
<br />
4) '''Primal Feasibility''': <math>\, g_i(x^*)\geq 0 , </math><br />
<br />
<br />
If any of the above four conditions are not satisfied, then the primal function is not feasible. <br />
<br />
=====Support Vectors=====<br />
Support vectors are the training points that determine the optimal separating hyperplane that we seek i.e. the margin is calculated as the distance from the hyperplane to the support vectors. Also, they are the most difficult points to classify and at the same time the most informative for classification.<br />
<br />
In our case, the <math>g_i({x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance unit away from the hyperplane, since <math>y_i(\beta^T \boldsymbol{x_i} + \beta_0)</math> is the value of the projected distance in the specific direction of the target value.<br />
<br />
'''Case 1: a point away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
In other words, if point <math>\, x_i</math> is not on the margin (i.e. <math>\boldsymbol{x_i}</math> is not a support vector), then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point on the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) = 1 \Rightarrow \alpha_i > 0 </math>.<br />
<br\>If point <math>\, x_i</math> is on the margin (i.e. <math>\boldsymbol{x_i}</math> is a support vector), then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
Since it is impossible for us to know a priori which of the training data points would end up as the support vectors, it is necessary for us to work with the entire training set to find the optimal hyperplane. It is usually the case that we only use a small number of support vectors, which makes the SVM model very robust to new data.<br />
<br />
<br />
To compute <math>\ \beta_0</math>, we need to choose any <math>\,\alpha_i > 0</math>, this will satisfy:<br />
<br />
<math>\,y_i(\beta^Tx_i+\beta_0) = 1</math>.<br />
<br />
We can compute <math>\,\beta = \sum_i \alpha_i y_i x_i </math>, substitute <math>\ \beta</math> in <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math> and solve for <math>\ \beta_0</math>.<br />
<br />
Everything we derived so far was based on the assumption that the data is linearly separable (termed '''Hard Margin SVM'''), but there are many cases in practical applications that the data is not linearly separable.<br />
<br />
=== Kernel Trick ===<br />
<br />
[[File:Kerneltrick.JPG|500px|thumb|right|An example of mapping 2D space into 3D such that the inseparable red o's and the blue +'s in 2D space can be separated when mapped into 3D space <ref><br />
Jordan (2004). ''The Kernel Trick.'' [Lecture]. Available: [http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf.]|</ref>]]<br />
<br />
We talked about the curse of dimensionality at the beginning of this course. However, we now turn to the power of high dimensions in order to find a hyperplane between two classes of data points that can linearly separate the transformed (mapped) data in a space that has a higher dimension than the space in which the training data points reside. <br />
<br />
To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the [http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf kernel trick] is basically to map data to a higher dimension in which the mapped data are linearly separable by a hyperplane, even if the original data are not linearly separable.<br />
<br />
The original optimal hyperplane algorithm proposed by [http://en.wikipedia.org/wiki/Vladimir_Vapnik Vladimir Vapnik] in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The algorithm is very similar, except that every dot product is replaced by a non-linear kernel function as below. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. We have seen SVM as a linear classification problem that finds the maximum margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised in order to solve the same linear classification problem but in a higher dimensional space, a [http://en.wikipedia.org/wiki/Feature_space feature space], under which the maximum margin hyperplane is better suited.<br />
<br />
In Machine Learning, the kernel trick is a way of mapping points into an inner product space, hoping that the new space is more suitable for classification. <br />
<math>\phi</math> is function to transfer a m-dimensional data to a higher dimension, so that we can find the connection between the non-linearly separable data the linearly separable ones.<br />
Example:<br />
<br />
<math> \left[\begin{matrix}<br />
\,x \\<br />
\,y \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x^2 \\<br />
\,y^2 \\<br />
\, \sqrt{2}xy \\<br />
\end{matrix}\right]</math><br />
<br />
<math>k(x,y)=\phi^{T}(x)\phi(y)</math><br />
<br />
<math> \left[\begin{matrix}<br />
\,x_1 \\<br />
\,y_1 \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x_1^2 \\<br />
\,y_1^2 \\<br />
\, \sqrt{2}x_1y_1 \\<br />
\end{matrix}\right]</math><br />
<br />
<math> \left[\begin{matrix}<br />
\,x_2 \\<br />
\,y_2 \\<br />
\end{matrix}\right] \rightarrow\ \left[\begin{matrix}<br />
\,x_2^2 \\<br />
\,y_2^2 \\<br />
\, \sqrt{2}x_2y_2 \\<br />
\end{matrix}\right]</math><br />
<br />
<br />
<br />
<math> \left[\begin{matrix}<br />
\,x_1^2 \\<br />
\,y_1^2 \\<br />
\, \sqrt{2}x_1y_1 \\<br />
\end{matrix}\right] ^{T} * \left[\begin{matrix}<br />
\,x_2^2 \\<br />
\,y_2^2 \\<br />
\, \sqrt{2}x_2y_2 \\<br />
\end{matrix}\right] = K(\left[\begin{matrix}<br />
\,x_1 \\<br />
\,y_1 \\<br />
\end{matrix}\right],\left[\begin{matrix}<br />
\,x_2 \\<br />
\,y_2 \\<br />
\end{matrix}\right] ) </math><br />
<br />
Recall our objective function: <math>\sum_i \alpha_i - \frac{1}{2} \sum_{ij} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j</math><br />
We can replace <math> \mathbf{x}_i^T\mathbf{x}_j </math> by <math> \mathbf{\phi^{T}(x_i)}\mathbf{\phi(x_j)}= k(x_i,x_j) </math><br />
<br />
<br />
<math> \left[\begin{matrix}<br />
\,k(x_1, x_1)& \,k(x_1, x_2)& \cdots &\,k(x_1, x_n) \\<br />
\vdots& \vdots& \vdots& \vdots\\<br />
\,k(x_n, x_1)& \,k(x_n, x_2)& \cdots &\,k(x_n, x_n) \\<br />
\end{matrix}\right] </math><br />
<br />
<br />
In most of the real world cases the data points are not linearly separable. How can the above methods be generalized to the case where the decision function is not a linear function of the data? Boser, Guyon and Vapnik, 1992, showed that a rather old trick (Aizerman, 1964) can be used to accomplish this in an astonishingly straightforward way. First notice that the only way in which the data appears in the dual-form optimization problem is in the form of dot products: <math>\mathbf{x}_i^T.\mathbf{x}_j</math> . Now suppose we first use a non-linear operator <math> \Phi \mathbf(x) </math> to map the data points to some other higher dimensional space (possibly infinite dimensional) <math> \mathcal{H} </math> (called Hilbert space or feature space), where they can be classified linearly. Figure below illustrates this concept:<br />
<br />
<br />
[[File:kernell trick.jpg|500px|thumb|centre|Mapping of not-linearly separable data points in a two-dimensional space to a three-dimensional space where they can be linearly separable by means of a kernel function.]]<br />
<br />
<br />
In other words, a linear learning machine can be employed in the higher dimensional feature space to solve the original non-linear problem. Then of course the training algorithm would only depend on the data through dot products in <math> \mathcal{H} </math>, i.e. on functions of the form <math><\Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)> </math>. Note that the actual mapping <math> \Phi \mathbf(x) </math> does not need to be known, only the inner product of the mapping is needed for modifying the support vector machine such that it can separate non-linearly separable data. Avoiding the actual mapping to the higher dimensional space is preferable, because higher dimensional spaces may have problems due to the ''curse of dimensionality''.<br />
<br />
So the hypothesis in this case would be<br />
<br />
<math>f(\mathbf{x}) = \boldsymbol{\beta}^T \Phi (\mathbf{x}) + \beta_0</math><br />
<br />
which is linear in terms of the new space that <math> \Phi (\mathbf{x}) </math> maps the data to, but non-linear in the original space. Now we can extend all the presented optimization problems for the linear case, for the transformed data in the feature space. If we define the kernel function as<br />
<br />
<math> K (\mathbf{x}_i,\mathbf{x}_j) = <\Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)> = \Phi(\mathbf{x}_i)^T \Phi (\mathbf{x}_j)</math><br />
<br />
where <math>\ \Phi </math> is a mapping from input space to an (inner product) feature space. Then the corresponding dual form is<br />
<br />
<br />
<math>L(\boldsymbol{\alpha}) =\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j K (\mathbf{x}_i,\mathbf{x}_j)</math><br />
<br />
subject to <math>\sum_{i=1}^n \alpha_i y_i=0 \quad \quad \alpha_i \geq 0,\quad i=1, \cdots, n</math><br />
<br />
<br />
The cost function <math> L(\boldsymbol{\alpha}) </math> is convex and quadratic in terms of the unknown parameters. This problem is solved through quadratic programming. The [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions KKT] conditions for this equation lead to the following final decision rule:<br />
<br />
<math> L(\mathbf{x}, \boldsymbol{\alpha}^{\ast}, \beta_0) =\sum_{i=1}^{N_{sv}} y_i \alpha_i^{\ast} K (\mathbf{x}_i,\mathbf{x}) + \beta_0</math><br />
<br />
<br />
where <math>\ N_{sv} </math> and <math>\ \alpha_i</math> denote number of support vectors and the non-zero Lagrange multipliers corresponding to the support vectors respectively. <br />
<br />
Several typical choices of kernels are linear, polynomial, Sigmoid or Multi-Layer Perceptron (MLP) and Gaussian or Radial Basis Function (RBF) kernel. Their expressions are as following:<br />
<br />
Linear kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = \mathbf{x}_i^T\mathbf{x}_j</math> <br />
<br />
Polynomial kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = (1 + \mathbf{x}_i^T\mathbf{x}_j)^p</math> <br />
<br />
Sigmoid (MLP) kernel: <math> K (\mathbf{x}_i,\mathbf{x}_j) = \tanh (k_1\mathbf{x}_i^T\mathbf{x}_j +k_2)</math> <br />
<br />
Gaussian (RBF) kernel: <math>\ K(\mathbf{x}_i,\mathbf{x}_j) = \exp\left[\frac{-(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)}{2\sigma^2 }\right]</math> <br />
<br />
<br />
Kernel functions satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's conditions] not only enables implicit mapping of data from input space to feature space but also ensure the convexity of the cost function which leads to the unique optimum. Mercer condition states that a continuous symmetric function <math> K \mathbf(x,y) </math> must be positive semi-definite to be a kernel function which can be written as inner product between the data pairs. Note that we would only need to use K in the training algorithm, and would never need to explicitly even know what <math>\ \Phi </math> is. <br />
<br />
Furthermore, one can construct new kernels from previously defined kernels.[http://www.cc.gatech.edu/~ninamf/ML10/lect0309.pdf] Given two kernels <math>K_1 (\mathbf{x}_i,\mathbf{x}_j)</math> and <math>K_2 (\mathbf{x}_i,\mathbf{x}_j)</math>, properties include:<br />
<br />
1. <math>K (\mathbf{x}_i,\mathbf{x}_j) = \alpha K_1 (\mathbf{x}_i,\mathbf{x}_j) + \beta K_2 (\mathbf{x}_i,\mathbf{x}_j) </math> for <math> \alpha , \beta \geq 0 </math><br />
<br />
2. <math>K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (\mathbf{x}_i,\mathbf{x}_j) K_2 (\mathbf{x}_i,\mathbf{x}_j) </math> <br />
<br />
3. <math>K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (f ( \mathbf{x}_i ) ,f ( \mathbf{x}_j ) ) </math> where <math>\, f \colon X \rightarrow X </math><br />
<br />
4. <math>K (\mathbf{x}_i,\mathbf{x}_j) = f ( K_1 ( \mathbf{x}_i , \mathbf{x}_j ) </math> where <math>\, f </math> is a polynomial with positive coefficients.<br />
<br />
<br />
In the case of Gaussian or RBF kernel for example, <math> \mathcal{H} </math> is infinite dimensional, so it would not be very easy to work with <math> \Phi </math> explicitly. However, if one replaces <math> <(\mathbf{x}_i). (\mathbf{x}_j)> </math> by <math> K (\mathbf{x}_i,\mathbf{x}_j) </math> everywhere in the training algorithm, the algorithm will happily produce a support vector machine which lives in an infinite dimensional space, and furthermore do so in roughly the same amount of time it would take to train on the un-mapped data. All the considerations of the previous sections hold, since we are still doing a linear separation, but in a different space.<br />
<br />
<br />
The choice of which kernel would be best for a particular application has to be determined through trial and error. Normally, the Gaussian or RBF kernel are best suited for classification tasks including SVM.<br />
<br />
<br />
The video below shows a graphical illustration of how a polynomial kernel works to a get better sense of kernel concept:<br />
<br />
[http://www.youtube.com/watch?v=3liCbRZPrZA Mapping data points to a higher dimensional space using a polynomial kernel]<br />
<br />
====Kernel Properties====<br />
Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. The Gram matrix is the matrix whose elements are <math>\ g_{ij} = K(x_i,x_j) </math>. Kernels which are said to satisfy the Mercer's theorem are positive semi-definite, meaning their kernel matrices have no non-negative Eigen values. The use of a positive definite kernel ensures that the optimization problem will be convex and solution will be unique. <ref> Reference:http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_properties</ref><br />
<br />
<br />
Furthermore, kernels can be categorized into classes based on their properties <ref name="Genton"> M. G. Genton, "Classes of Kernels for Machine Learning: A Statistics Perspective," ''Journal of Machine Learning Research 2'', 2001</ref>:<br />
* ''Nonstationary kernels'' are explicitly dependent on both inputs (e.g., the polynomial kernel).<br />
* ''Stationary kernels'' are invariant to translation (e.g., the Gaussian kernel which only looks at the distance between the inputs).<br />
* ''Reducible kernels'' are nonstationary kernels that can be reduced to stationary kernels via a bijective deformation (for more detailed information see <ref name = "Genton" />).<br />
<br />
====Further Information of Kernel Functions====<br />
<br />
In class we have studied 3 kernel functions, linear, polynomial and gaussian kernel. The following are some properties for each:<br />
# '''Linear Kernel''' is the simplest kernel. Algorithms using this kernel are often equivalent to non-kernel algorithms such as standard PCA<br />
# '''Polynomial Kernel''' is a non-stationary kernel, well suited when training data is normalized.<br />
# '''Gaussian Kernel''' is an example of radial basis function kernel.<br />
<br />
When choosing a kernel we need to take into account the data we are trying to model. For example, data that clusters in circles (or hyperspheres) is better classified by Gaussian Kernel.<br />
<br />
Beyond the kernel functions we discussed in class, such as Linear Kernel, Polynomial Kernel and Gaussian Kernel functions, many more kernel functions can be used in the application of kernel methods for machine learning. <br />
<br />
Some examples are: Exponential Kernel, Laplacian Kernel, ANOVA Kernel, Hyperbolic Tangent (Sigmoid) Kernel, Rational Quadratic Kernel, Multiquadric Kernel, Inverse Multiquadric Kernel, Circular Kernel, Spherical Kernel, Wave Kernel, Power Kernel, Log Kernel, Spline Kernel, B-Spline Kernel, Bessel Kernel, Cauchy Kernel, Chi-Square Kernel, Histogram Intersection Kernel, Generalized Histogram Intersection Kernel, Generalized T-Student Kernel, Bayesian Kernel, Wavelet Kernel, etc. <br />
<br />
You may visit http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_functions for more information.<br />
<br />
=== Case 2: Linearly Non-Separable Data (Soft Margin) ===<br />
<br />
The original SVM was specifically made for separable data. But, this is a very strong requirement, so it was suggested by Vladimir Vapnik and Corinna Cortes later on to remove this requirement. This is called Soft Margin Support Vector Machine. One of the advantages of SVM is that it is relatively easy to generalize it to the case that the data is not linearly separable.<br />
<br />
In the case when 2 data sets are not linearly separable, it is impossible to have a hyperplane that completely separates 2 classes of data. In this case the idea is to minimize the number of points that cross the margin and are miss-classified .So we are going to minimize that are going to violate the constraint: <br />
<br />
<math>\, y_i(\beta^T x_i + \beta_0) \geq 1</math><br />
<br />
Hence we allow some of the points to cross the margin (or equivalently violate our constraint) but on the other hand we penalize our objective function (so that the violations of the original constraint remains low): <br />
<br />
<math>\, min (\frac{1}{2} |\beta|^2 +\gamma \sum_i \zeta_i) </math><br />
<br />
And now our constraint is as follows: <br />
<br />
<math>\, y_i(\beta^T x_i + \beta_0) \geq 1-\zeta_i</math><br />
<br />
<math>\, \zeta_i \geq 0</math><br />
<br />
We have to check that all '''KKT''' conditions are satisfied: <br />
<br />
<math>\, \mathcal{L}(\beta,\beta_0,\zeta_i,\alpha_i,\lambda_i)=\frac{1}{2}|\beta|^2+\gamma \sum_i \zeta_i -\sum_i \alpha_i[y_i(\beta^T x_i +\beta_0)-(1-\zeta_i)] - \sum_i \lambda_i \zeta_i</math><br />
<br />
<math>\, 1) \frac{\partial\mathcal{L}}{\partial \beta}=\beta-\sum_i \alpha_i y_i x_i \rArr \beta=\sum_i \alpha_i y_i x_i</math><br />
<br />
<math>\, 2) \frac{\partial\mathcal{L}}{\partial \beta_0}=\sum_i \alpha_i y_i =0</math><br />
<br />
<br />
<math>\, 3) \frac{\partial\mathcal{L}}{\partial \zeta_i}=\gamma - \alpha_i - \lambda_i </math><br />
<br />
Now we have to write this into a Lagrangian form.<br />
<br />
== Support Vector Machine Continued (Lecture: Nov. 3, 2011) ==<br />
<br />
=== Case 2: Linearly Non-Separable Data (Soft Margin [http://fourier.eng.hmc.edu/e161/lectures/svm/node5.html]) Continued ===<br />
<br />
Recall from last time that soft margins are used instead of hard margins when we are using SVM to classify data points that are '''not''' linearly separable. <br />
<br />
===== Soft Margin SVM Derivation of Dual =====<br />
<br />
The soft-margin SVM optimization problem is defined as:<br />
<br />
<math>\min \{\frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i\}</math> <br />
<br />
subject to the constraints<br />
<math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad ,\quad \zeta_i \ge 0</math>,<br />
<br />
where <math>\boldsymbol \gamma \sum_i \zeta_i \quad \quad</math> is the penalty function that penalizes the slack variable. Note that <math>\zeta_i=0</math> denotes the Hard Margin SVM classifier.<br />
<br />
(where <math>\zeta > 0 </math> represents some points across the margin). <br />
<br />
In other words, we have relaxed the constraint for each <math>\boldsymbol{x_i}</math> so that it can violate the margin by an amount <math>\zeta_i</math>.<br />
As such, we want to make sure that all <math>\zeta_i</math> values are as small as possible. So, we penalize them in the objective function by a factor of some chosen <math>\gamma</math>.<br />
<br />
=====Forming the Lagrangian=====<br />
<br />
In this case we have have two constraints in the Lagrangian primal form (<math>\beta</math> and <math>\zeta</math>) and therefore we optimize with respect to two dual variables <math>\, \alpha</math> and <math>\,\lambda</math>,<br />
<br />
<math><br />
L(\boldsymbol{\beta},\beta_0,\zeta_i,\alpha_i,\lambda_i) = \frac{1}{2} |\boldsymbol{\beta}|^2 + \gamma \sum_i \zeta_i - \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] - \sum_i \lambda_i \zeta_i<br />
</math> <br />
<br />
Note the following simplification:<br />
<br />
<math>- \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = -\boldsymbol{\beta}^T\sum_i\alpha_i y_i x_i-\beta_0\sum_i\alpha_iy_i+\sum_i\alpha_i-\sum_i\alpha_i\zeta_i</math><br />
<br />
=====Apply KKT conditions=====<br />
<br />
<math><br />
\begin{align}<br />
1) &\frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} = \boldsymbol{\beta}-\sum_i \alpha_i y_i \boldsymbol{x_i} = 0 \\<br />
& \rightarrow \boldsymbol{\beta} = \sum_i \alpha_i y_i \boldsymbol{x_i} \\<br />
&\frac{\partial \mathcal{L}}{\partial \beta_0} = \sum_i \alpha_i y_i = 0 \\<br />
&\frac{\partial \mathcal{L}}{\partial \zeta_i} = \gamma - \alpha_i - \lambda_i = 0 \\<br />
& \rightarrow \boldsymbol{\gamma} = \alpha_i + \lambda_i \\<br />
2) &\text{dual feasibility: } \alpha_i \ge 0, \lambda_i \ge 0 \\<br />
3) &\alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = 0, \text{ and } \lambda_i \zeta_i = 0 \\<br />
4) &y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad,\quad \zeta_i \ge 0 \\<br />
\end{align}<br />
</math><br />
<br />
=====Objective Function=====<br />
Simplifying the Lagrangian the same way we did with the hard margin case, we get the following:<br />
<br />
<math><br />
\begin{align}<br />
L &= \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \gamma \sum_i \zeta_i - \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} - \beta_0 \sum_i \alpha_i y_i + \sum_i \alpha_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i - 0 + (\sum_i \gamma \zeta_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i) \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i + \sum_i (\gamma - \alpha_i - \lambda_i) \zeta_i \\<br />
&= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i<br />
\end{align}<br />
</math><br />
<br />
subject to the constaints:<br />
<br />
<math><br />
\begin{align}<br />
\alpha_i &\ge 0 \\<br />
\sum_i \alpha_i y_i &= 0 \\<br />
\lambda_i &\ge 0<br />
\end{align}<br />
</math><br />
<br />
Notice that the simplified Lagrangian is the exact same as the hard margin case. The only difference with the soft margin case is the additional constraint <math>\lambda_i \ge 0</math>. However, <math>\gamma</math> doesn't actually appear directly in the objective function. But, we can discern the following:<br />
<br />
<math>\lambda_i = 0 \implies \alpha_i = \gamma</math><br />
<br />
<math>\lambda_i > 0 \implies \alpha_i < \gamma</math><br />
<br />
Thus, we can derive that the only difference with the soft margin case is the constraint <math>0 \le \alpha_i \le \gamma</math>. This problem can be solved with quadratic programming.<br />
<br />
===== Soft Margin SVM Formulation Summary =====<br />
<br />
In summary, the primal form of the soft-margin SVM is given by:<br />
<br />
<math><br />
\begin{align}<br />
\min_{\boldsymbol{\beta}, \boldsymbol{\zeta}} \quad & \frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i \\<br />
\text{s.t. } & y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i \quad, \quad \zeta_i \ge 0 \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
<br />
The corresponding dual form which we derived above is:<br />
<br />
<math><br />
\begin{align}<br />
\max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} \\<br />
\text{s.t. } & \sum_i \alpha_i y_i = 0 \\<br />
& 0 \le \alpha_i \le \gamma, \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
Note, the soft-margin dual objective is identical to hard margin dual objective! The only difference is now <math>\,\alpha_i</math> variables cannot be unbounded and are restricted to be a maximum of <math>\,\gamma</math>. This restriction allows the optimization problem to become feasible when the data is non-seperable. In the hard-margin case, when <math>\,\alpha_i</math> is unbounded there may be no finite maximum for the objective and we would not be able to converge to a solution. <br />
<br />
Also note, <math>\,\gamma</math> is a model parameter and must be chosen to a fixed constant. It controls the size of margin versus violations. In a data set with a lot of noise (or non-seperability) you may want to choose a smaller <math>\,\gamma</math> to ensure a large margin. In practice, <math>\,\gamma</math> is chosen by cross-validation---which tests the model on a held out sample to determine which <math>\,\gamma</math> gives the best result. However, it may be troublesome to work with <math>\,\gamma</math> since <math>\,\gamma \in (0, \infty)</math>. So often a variant formulation, known as <math>\,\nu</math>-SVM is used which uses a better scaled parameter <math>\,\nu \in (0,1)</math> instead of <math>\,\gamma</math> to balance margin versus separability. <br />
<br />
Finally note that as <math>\,\gamma \rightarrow \infty</math>, the soft-margin SVM converges to hard-margin, as we do not allow any violation.<br />
<br />
=====Soft Margin SVM Problem Interpretation =====<br />
<br />
Like in the case of hard-margin the dual formulation for soft-margin given above allows us to interpret the role of certain points as support vectors. <br />
<br />
We consider three cases:<br />
<br />
'''Case 1:''' <math>\,\alpha_i=\gamma</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i = 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now suggests <math>\,\zeta_i > 0</math>. <br />
<br />
Thus this is a point that violates the margin, and we say <math>\,x_i</math> is inside the margin.<br />
<br />
'''Case 2:''' <math>\,\alpha_i=0</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i > 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now implies <math>\,\zeta_i = 0</math>. <br />
<br />
Finally, from KKT condition 3 (first part), <math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) > 1-\zeta_i</math>, and since <math>\,\zeta_i = 0</math>, the point is classified correctly and we say <math>\,x_i</math> is outside the margin. In particular, <math>\,x_i</math> does not play a role in determining the classifier and if we ignored it, we would get the same result.<br />
<br />
'''Case 3:''' <math>\,0 < \alpha_i < \gamma</math><br />
<br />
From KKT condition 1 (third part), <math>\,\gamma - \alpha_i - \lambda_i = 0</math> implies <math>\,\lambda_i > 0</math>.<br />
<br />
From KKT condition 3 (second part) <math>\,\lambda_i \zeta_i = 0</math> this now implies <math>\,\zeta_i = 0</math>. <br />
<br />
Finally, from KKT condition 3 (first part), <math>y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) = 1-\zeta_i</math>, and since <math>\,\zeta_i = 0</math>, the point is on the margin and we call it a support vector.<br />
<br />
These three scenarios are depicted in Fig..<br />
<br />
'''Case 4:''' if <math>\boldsymbol \zeta_i > 0</math> implies <math>\boldsymbol \lambda_i=0</math> this now implies <math>\boldsymbol \alpha_i=\gamma </math> from which we know that <math>y_1(\beta^*\mathbf{x}+\beta_0)\ge 1-\zeta_i </math> it is closer to the boundary, so <math>x_i</math> is inside the margin.<br />
<br />
=====Soft Margin SVM with Kernel =====<br />
<br />
Like hard-margin SVM, we can use the kernel trick to find a non-linear classifier using the dual formulation.<br />
<br />
In particular, we define a non-linear mapping for <math> \boldsymbol{x_i} </math> as <math> \Phi(\boldsymbol{x_i}) </math>, then in dual objective we compute <math> \Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j}) </math> instead of <math> \boldsymbol{x_i}^T \boldsymbol{x_j} </math>. Using a kernel function <math> K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j}) </math> from the list provided in the previous lecture notes, we then do not need to explicitly map <math> \Phi(\boldsymbol{x_i}) </math>.<br />
<br />
The dual problem we solve is:<br />
<br />
<math><br />
\begin{align}<br />
\max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(\boldsymbol{x_i}, \boldsymbol{x_j}) \\<br />
\text{s.t. } & \sum_i \alpha_i y_i = 0 \\<br />
& 0 \le \alpha_i \le \gamma, \qquad i=1,...,M<br />
\end{align}<br />
</math><br />
<br />
where <math>\, K(\boldsymbol{x_i}, \boldsymbol{x_i}) </math> is an appropriate kernel function specification.<br />
<br />
To make it clear why we do not need to explicitly map <math> \Phi(\boldsymbol{x_i}) </math>: If we use the kernel trick, both hard- and soft-margin SVMs find the following value for the optimum <math> \boldsymbol{\beta} </math>:<br />
<br />
<math> \boldsymbol{\beta} = \sum_i \alpha_i y_i \Phi(\boldsymbol{x_i}) </math><br />
<br />
From the definition of the classifier, the class labels for points are given by:<br />
<br />
<math> \boldsymbol{\beta}^T \Phi(\boldsymbol{x}) + \beta_0 </math><br />
<br />
Plugging the formula for <math> \boldsymbol{\beta} </math> in the expression above we get:<br />
<br />
<math> \sum_i \alpha_i y_i \Phi(\boldsymbol{x_i}) \Phi(\boldsymbol{x}) + \Beta_0 </math><br />
<br />
which, from the properties of kernel functions, is equal to:<br />
<br />
<math> \sum_i \alpha_i y_i K(\boldsymbol{x_i}, \boldsymbol{x_i}) + \Beta_0 </math><br />
<br />
Thus, we do not need to explicitly map <math> \boldsymbol{x_i} </math> to a higher dimension.<br />
<br />
=====Soft Margin SVM Implementation =====<br />
<br />
The SVM optimization problem is a quadratic program and we can use any quadratic solver to accomplish this. For example, matlab's optimization toolbox provides <code>quadprog</code>. Alternatively, CVX (by Stephen Boyd) is an excellent optimization toolbox that integrates with matlab and allows one to enter convex optimization problems as though they are written on paper (and it is free). <br />
<br />
We prefer to solve the dual since it is an easier problem (and also allows to use a Kernel). Using CVX this would be coded as<br />
<br />
<pre><br />
K = X*X'; % Linear kernel<br />
H = (y*y') .* K;<br />
cvx_begin <br />
variable alpha(M,1);<br />
maximize (sum(alpha) - 0.5*alpha'*H*alpha)<br />
subject to<br />
y'*alpha == 0; <br />
alpha >= 0;<br />
alpha <= gamma<br />
cvx_end<br />
</pre><br />
<br />
which provides us with optimal <math>\,\boldsymbol{\alpha}</math>. <br />
<br />
Now we can obtain <math>\,\beta_0</math> by using any point on the margin (i.e. <math>\,0 < \alpha_i < \gamma</math>), and solving<br />
<br />
<math><br />
y_i \left(\sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x_i}) + \beta_0 \right) = 1<br />
</math><br />
<br />
Note, <math>\,K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \boldsymbol{x_i}^T \boldsymbol{x_j}</math> can also be the linear kernel. <br />
<br />
Finally, we can classify a new data point <math>\,\boldsymbol{x}</math>, according to<br />
<br />
<math>h(\boldsymbol{x}) = <br />
\begin{cases} <br />
+1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 > 0\\<br />
-1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 < 0<br />
\end{cases}<br />
</math><br />
<br />
Alternatively, using traditional Mat Lab the following code finds b and b0. <br />
<br />
<pre><br />
ell = size(X, 1);<br />
H = (y * y') .* (X * X' + (1/gamma) * eye(ell));<br />
f = -ones(1, ell);<br />
LB = zeros(ell, 1);<br />
UB = gamma * ones(ell, 1);<br />
alpha = quadprog(H, f, [], [], y', 0, LB, UB);<br />
b = X*(alpha.*y);<br />
# Here we try to select the closest point to the margin for b0, thus finding the best origin for our classifer<br />
i =min(find((alpha>0.1)&(y==1)));<br />
b0 = 1 - (X * X')(i, :) * (alpha .* y);<br />
</pre><br />
<br />
===== Intuitive Connection to Hard Margin Case =====<br />
The form of the dual in both the Hard Margin & Soft Margin case are exceedingly; the only difference is a further restriction(<math>\ \alpha_i < \gamma</math>) on the dual variable. You could even implement the soft margin problem to solve a case where the hard margin problem is feasible. This is not typically done but doing can give considerable insight into how the soft margin problem reacts to changes in <math>\ \gamma </math>. If we let <math>\ \gamma \to +\infty</math> we see that the soft margin problem approaches the hard margin problem. If we examine the primal problem this matches our intuitive expectation. As <math>\ \gamma \to +\infty</math> the penalty for being inside the margin increases to infinity and thus the optimal solution will place paramount importance of having a hard margin. <br />
<br />
When choosing <math>\ \gamma </math> one needs to be careful and understand the implications. Values of <math>\ \gamma </math> that are too large will result in slavish dedication to getting as close to a hard margin as possible. This can result in poor decisions especially if there are outliers involved. Values of <math>\ \gamma </math> that are too small do not adequate punish the problem for misclassifying points. It is important to both test different values for <math>\ \gamma </math> and to exercise discretion when selecting possible values of <math>\ \gamma </math> to test. It is also important to examine the impact of outliers as their impact can be extremely destructive to the usefulness of the SVM classifier.<br />
<br />
<br />
===Multiclass Support Vector Machines===<br />
<br />
Support vector machines were originally designed for binary classification; therefore we need a methodology to adopt the binary SVMs to a multi-class problem. How to effectively extend SVMs for multi-class classification is still an ongoing research issue. Currently the most popular approach for multi-category SVM is by constructing and combining several binary classifiers.Different coding and decoding strategies can be used for this purpose among which one-against-all and one-against-one (pairwise) are the most popular ones <ref name="CMBishop" />. .<br />
<br />
====One-Against-All method====<br />
Assume that we have <math>\ k </math> discrete classes. For a one-against-all SVM, we determine <math>\ k </math> decision functions that separate one class from the remaining classes. Let the <math>\ i^{th} </math> decision function, with the maximum margin, that separates class <math>\ i </math> from the remaining classes be:<br />
<br />
<br />
<math>D_i(\mathbf{x})=\mathbf{w}_i^Tf(\mathbf{x})+b_i</math><br />
<br />
<br />
The hyperplane<math>\ D_i(\mathbf{x})=0 </math> forms the optimal separating hyperplane and if the classification problem is separable, the training data <math>\mathbf{x}</math> belonging to class <math>\ i</math> satisfy <br />
<br />
<math>\begin{cases}<br />
F_i(\mathbf{x})\geq1 &,\mathbf{x}\text{ belong to class }i\\<br />
F_i(\mathbf{x})\leq-1 &,\mathbf{x}\text{ belong to remaining classes}\\<br />
\end{cases}<br />
</math><br />
<br />
In other words, the decision function is the sign of <math>\ D_i(\mathbf{x})</math> and therefore it is a discrete function. If the above equation is satisfied for plural <math>\ i's </math> , or there is no <math>\ i </math> that satisfies this equation, <math>\mathbf{x})</math> is unclassifiable. Figure below demonstrates the one-vs-all multi-class scheme where the pink area is the unclassifiable region.<br />
<br />
[[File:one-vs-all multiclass.jpg|400px|thumb|centre|one-against-all multi-class scheme]]<br />
<br />
====One-Against-One (Pairwise) method====<br />
<br />
In this method we construct a binary classifier for each possible pair of classes and therefore for <math>\ k </math> classes we will have <math>\frac{(k)(k-1)}{2} </math> decision functions. The decision function for the pair of classes <math>i</math> and <math>j</math> is given by<br />
<br />
<math>D_{ij}=\mathbf{w}_{ij}^Tf(\mathbf{x})+b_{ij}</math><br />
<br />
<br />
where <math>D_{ij}(\mathbf{x})=-D_{ij}(\mathbf{x})</math>.<br />
<br />
<br />
The final decision is achieved by maximum voting scheme. That is for the datum <math>\mathbf{x}</math> we calculate<br />
<br />
<br />
<math>D_i(\mathbf{x})=\sum_{j\neq i,i=1}sign(D_{ij}(\mathbf{x}))</math><br />
<br />
<br />
And <math>\mathbf{x}</math> is classified into the class: <math>arg\quad \max_i\quad D_i({\mathbf{x}})</math><br />
<br />
<br />
Figure below demonstrates the one-vs-one multi-class scheme where the pink area is the unclassifiable region.<br />
<br />
<br />
<br />
[[File:one-vs-one multiclass.jpg|400px|thumb|centre|one-vs-one multi-class scheme]]<br />
<br />
===Advantages of Support Vector Machines===<br />
<br />
* SVMs provide a good out-of-sample generalization. This means that, by choosing an appropriate generalization grade, <br />
SVMs can be robust, even when the training sample has some bias. This is mainly due to selection of optimal hyperplane.<br />
* SVMs deliver a unique solution, since the optimality problem is convex. This is an advantage compared <br />
to Neural Networks, which have multiple solutions associated with local minima and for this reason may <br />
not be robust over different samples.<br />
*State-of-the-art accuracy on many problems. <br />
*SVM can handle any data types by changing the kernel.<br />
<br />
===Disadvantages of Support Vector Machines===<br />
<br />
*Difficulties in choice of the kernel (Which we will study about in future).<br />
<br />
* limitation in speed and size, both in training and testing <br />
<br />
*Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained.<br />
<br />
*The optimal design for multiclass SVM classifiers is a further area for research.<br />
<br />
*A problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks.<br />
<br />
===Comparison with Neural Networks <ref>www.cs.toronto.edu/~ruiyan/csc411/Tutorial11.ppt</ref>===<br />
<br />
#Neural Networks:<br />
##Hidden Layers map to lower dimensional spaces<br />
##Search space has multiple local minima<br />
##Training is expensive<br />
##Classification extremely efficient<br />
##Requires number of hidden units and layers<br />
##Very good accuracy in typical domains<br />
#SVMs<br />
##Kernel maps to a very-high dimensional space<br />
##Search space has a unique minimum<br />
##Training is extremely efficient<br />
##Classification extremely efficient<br />
##Kernel and cost the two parameters to select<br />
##Very good accuracy in typical domains<br />
##Extremely robust<br />
<br />
=== The Naive Bayes Classifier ===<br />
<br />
The naive Bayes classifier is a very simple (and often effective) classifier based on Bayes rule. <br />
For further reading check [http://www.saylor.org/site/wp-content/uploads/2011/02/Wikipedia-Naive-Bayes-Classifier.pdf]<br />
<br />
Bayes assumption is that all the features are conditionally independent given the class label. Even though this is usually false (since features are usually dependent), the resulting model is easy to fit and works surprisingly well.<br />
<br />
Each feature or variable <math>\,x_{ij}</math> is independent for <math>\,j = 1, ..., d</math>, where <math>\, \mathbf{x}_i \in \mathbb{R}^d</math>.<br />
<br />
Thus the Bayes classifier is<br />
<math> h(\mathbf{x}) = \arg\max_k \quad \pi_k f_k(\mathbf{x})</math><br />
<br />
where <math>\hat{f}_k(\mathbf{x}) = \hat{f}_k(x_1 x_2 ... x_d)= \prod_{j=1}^d \hat{f}_{kj}(x_j)</math>.<br />
<br />
We can see this a direct application of Bayes rule<br />
<math> P(Y=k|X=\mathbf{x}) =\frac{P(X=\mathbf{x}|Y=y) P(Y=y)} {P(X=\mathbf{x})} = \frac{f_k(\mathbf{x}) \pi_k} {\sum_k f_k \pi_k}</math>,<br />
<br />
with <math>\, f_k(\mathbf{x})=f_1(\mathbf{x})f_2(\mathbf{x})...f_k(\mathbf{x})</math> and <math>\ \mathbf{x} \in \mathbb{R}^d</math>.<br />
<br />
Note, earlier we assume class-conditional densitites which were multivariate normal with a dense covariance matrix. In this case we are forcing the covariance matrix to be a diagonal. This simplification, while not realistic, can provide a more robust model.<br />
<br />
As another example, consider the 'iris' dataset in R. We would like to use known data (sepal length, sepal width, petal length, and petal width) to predict species of iris. As is typically done, we will use the maximum a posteriori (MAP) rule to decide the class to which each observation belongs. The code for using a built-in function in R to classify is:<br />
<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
#If you were to use a built-in function for Naive Bayes Classification, <br />
#this is how it would work:<br />
<br />
library(lattice) #these are the libraries from which packages are needed<br />
library(class)<br />
library(e1071)<br />
<br />
count = 0 #This will keep track of properly classified objects<br />
attach(iris)<br />
model <- (Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)<br />
m <- naiveBayes(model, data = iris)<br />
p <- predict(m, iris) #You could also use a table here<br />
for(i in 1:length(Species)) {<br />
if (p[i] == Species[i]) {<br />
count = count + 1<br />
}}<br />
misclass = (length(Species)-count)/length(Species)<br />
misclass<br />
#So we get that 4% of the points are misclassified.<br />
</pre><br />
<br />
In this particular dataset, we would not expect naïve Bayes to be the best approach for classification, since the assumption of independent predictor variables is violated (sepal length and sepal width are related, for example). However, misclassification rate is low, which indicates that naïve Bayes does a good job of classifying these data.<br />
<br />
=== [http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm K-Nearest-Neighbors(k-NN)] ===<br />
<br />
[[File:KNN.jpg|250px|thumb|right|Classifying x by assigning it the label most frequently represented among k nearest samples and use a voting scheme.]]<br />
<br />
Given a data point x, find the k nearest data points to x and classify x using the majority vote of these k neighbors (k is a positive<br />
integer, typically small.) If k=1, then the object is simply assigned to the class of its nearest neighbor.<br />
<br />
<br />
# Ties can be broken randomly.<br />
# k can be chosen by cross-validation<br />
# k-nearest neighbor algorithm is sensitive to the local structure of the data<ref><br />
http://www.saylor.org/site/wp-content/uploads/2011/02/Wikipedia-k-Nearest-Neighbor-Algorithm.pdf</ref>.<br />
# Nearest neighbor rules in effect compute the decision boundary in an implicit manner.<br />
<br />
=====Requirements of k-NN:=====<br />
<ref>http://courses.cs.tamu.edu/rgutier/cs790_w02/l8.pdf</ref><br />
# An integer k<br />
# A set of labeled examples (training data)<br />
# A metric to measure “closeness”<br />
<br />
=====Advantages:=====<br />
# Able to obtain optimal solution in large sample.<br />
# Simple implementation<br />
# There are some noise reduction techniques that work only for k-NN to improve the efficiency and accuracy of the classifier.<br />
<br />
=====Disadvantages:=====<br />
# If the training set is too large, it may have poor run-time performance.<br />
# k-NN is very sensitive to irrelevant features since all features contribute to the similarity and thus to classification.<ref><br />
http://www.google.ca/url?sa=t&rct=j&q=k%20nearest%20neighbors%20disadvantages&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.100.1131%26rep%3Drep1%26type%3Dpdf&ei=3feyToHMG8Xj0QGOoMDKBA&usg=AFQjCNFF1XsYgZy1W2YLQMNTq_7s07mfqg&sig2=qflY4MffEHwP9n-WpnWMdg</ref><br />
# small training data can lead to high misclassification rate.<br />
# kNN suffers from the curse of dimensionality. As the number of dimensions of the feature space increases, points become further apart from each other, making it harder to classify new points. In 10 dimensions, each point needs to cover an area of approximately 80% the value of each coordinate to capture 10% of the data. (See textbook page 23). Algorithms to solve this problem include approximate nearest neighbour. <ref>P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality. STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing. pg 604-613.</ref><br />
<br />
=====Extensions and Applications=====<br />
<br />
In order to improve the obtained results, we can do following:<br />
# Preprocessing: smoothing the training data (remove any outliers and isolated points)<br />
# Adapt metric to data<br />
<br />
Besides classification, k-nearest-neighbours is useful for other tasks as well. For example, the k-NN has been used in Regression or Product Recommendation system<ref><br />
http://www.cs.ucc.ie/~dgb/courses/tai/notes/handout4.pdf</ref>.<br />
<br />
In 1996 Support Vector Regression <ref>"Support Vector Regression Machines". Advances in Neural Information Processing Systems 9, NIPS 1996, 155–161, MIT Press.</ref> was proposed. SVR depends only on a subset of training data since the cost function ignores training data close to the prediction withing a threshold.<br />
<br />
SVM is commonly used in Bioinformatics. Common uses include classification of DNA sequences and promoter recognition and identifying disease-related microRNAs. Promoters are short sequences of DNA that act as a signal for gene expression. In one paper, Robertas Damaševičius tries using a power series kernel function and 11 classification rules for data projection to classifty these sequences, to aid active gene location.<ref>Damaševičius, Robertas. "Analysis of Binary Feature Mapping Rules for Promoter Recognition in Imbalanced DNA Sequence Datasets using Support Vector Machine". Proceedings from 4th International IEEE Conference "Intelligent Systems". 2008.</ref> MicroRNAs are non-coding RNAs that target mRNAs for cleavage in protein synthesis. There is growing evidence suggesting that mRNAs "play important roles in human disease development, progression, prognosis, diagnosis and evaluation of treatment response". Therefore, there is increasing research in the role of mRNAs underlying human diseases. SVM has been proposed as a method of classifying positive mRNA disease-associations from negative ones.<ref>Jiang, Qinghua; Wang, Guohua; Zhang, Tianjiao; Wang, Yadong. "Predicting Human microRNA-disease Associations Based on Support Vector Machine". Proceedings from IEEE International Conference on Bioinformatics and Biomedicine. 2010.</ref><br />
<br />
=====Selecting k=====<br />
Generally speaking, a large k classifies data more precisely than a smaller k as it reduces the overall noise. But as k increases so does the complexity of computation. To determine an optimal k, cross-validation can be used.<ref>http://chem-eng.utoronto.ca/~datamining/dmc/k_nearest_neighbors_reg.htm</ref> Traditionally, k is fixed for each test example. Another approach, namely Adaptive k-nearest neighbor algorithm, was proposed to improve the selection of k. In the algorithm, k is not a fixed number but is dependent on the nearest neighbour of the data point. In training phase, the algorithm calculates the optimal k for each training data point, which is the minimum number of neighbors required to get the correct class label. In the testing phase, it finds out the nearest neighbor of the testing data point and its corresponding optimal k. Then it performs the k-NN algorithm using such k to classify the data point. <ref>Shiliang Sun, Rongqing Huang, "An adaptive k-nearest neighbor algorithm", 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010.</ref><br />
<br />
=====Further Readings=====<br />
1- SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641014 here]<br />
<br />
2- SVM application list[http://www.clopinet.com/isabelle/Projects/SVM/applist.html here]<br />
<br />
3- The kernel trick for distances [http://74.125.155.132/scholar?q=cache:AfKdFY6a1cMJ:scholar.google.com/&hl=en&as_sdt=2000 here]<br />
<br />
4- Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry [http://bioinformatics.oxfordjournals.org/content/20/12/1948.short here]<br />
<br />
5- General overview of SVM and Kernel Methods. Easy to understand presentation. [http://www.support-vector.net/icml-tutorial.pdf here]<br />
<br />
== Supervised Principal Component Analysis (Lecture: Nov. 8, 2011) ==<br />
<br />
Recall that '''PCA''' finds the direction of maximum variation of <math>d</math>-dimensional data, and may be used as a dimensionality reduction pre-processing operation for classification. '''FDA''' is a form of supervised dimensionality reduction or feature extraction that finds the best direction to project the data in order for the data points to be easily separated into their respective classes by considering inter- and intra-class distances (i.e. minimize intra-class distance and variance, maximize inter-class distance and variance). PCA differs from FDA in that PCA is an unsupervised classifier, whereas FDA is supervised classifier. Thus, FDA is better at finding the directions separating the data points for classification in a supervised problem. <br />
<br />
'''Supervised PCA (SPCA)''' is a generalization of PCA. SPCA can use label information for classification tasks and it has some advantages over FDA. For example, FDA will only project onto <math>\ k-1 </math> dimensional space regardless of the dimensionality of the data where <math>\ k </math> is the number of classes. This is not always desirable for dimensionality reduction.<br />
<br />
SPCA estimates the sequence of principal components having the maximum dependency on the response variable. It can be solved in closed-form, has a dual formulations that reduces the computational complexity when the dimension of the data is significantly greater than the number of data points, and it can be kernelized. <ref>Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri. Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds , Journal of Pattern Recognition, to appear 2011</ref><br />
<br />
===SPCA Problem Statement===<br />
Suppose we are given a set of data <math>\ \{x_i, y_i\}_{i=1}^n , x_i \in R^{p}, y_i \in R^{l}</math>. Note that <math>\ y_i</math> is not restricted to binary classes. So the assumption of having only discrete values for labels is relaxed here, which means this model can be used for regression as well. Target values (<math>\ y </math>) don't have to be in a one dimensional space. Just as for PCA, we are looking for a lower dimensional subspace <math>\ S = U^T X </math>, where <math>\ U </math> is an orthogonal projection. However, instead of finding the direction of maximum variation (as is the case in regular PCA), we are looking for the subspace that contains as much predictive information about <math>\ Y </math> as the original covariate <math>\ X </math>, i.e. we are trying to determine a projection matrix <math>\ U</math> such that <math>\ P(Y|X)=P(Y|U^TX) </math>. We know that the predictive information must exist between the original covariate <math>\ X </math> and <math>\ Y </math>, which are assumed to be drawn iid from the distribution <math>\ \{x_i, y_i\}_{i=1}^n </math>, because if they are completely independent there is no way of doing classification or regression.<br />
<br />
===Warning===<br />
If we project our data into a high enough dimension, we can fit any data - even noise. In his book "The God gene: how faith is hardwired into our genes", Dean H. Hamer discusses how factor analysis (model which "uses regression modelling techniques to test hypotheses producing error terms" <ref>use regression modelling techniques to test hypotheses producing error terms</ref>) was used to find a correlations between the gene (VMAT2) and a person's belief in God. The full book is available at: <ref>http://books.google.ca/books?id=TmR6uAAHEssC&pg=PA33&lpg=PA33&dq=god+gene+statistics&source=bl&ots=8q-jSwKZ8O&sig=O8OBe2YaPbE0vMp9A6PxEC9DwL0&hl=en&ei=lWO8Tp_nN4H40gGA2uXjBA&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCEQ6AEwAQ#v=onepage&q&f=false </ref>. <br />
<br />
It appears as though finding a correlation between seemingly uncorrelated data is sometimes statistically trivial. One study found correlations between people shopping habits and their genetics. Family members were shown to have far more similar consumer habits than those who did not share DNA. This was then used to explain "fondness for specific products such as chocolate, science-fiction movies, jazz, hybrid cars and mustard." <ref>http://www.businessnewsdaily.com/genetics-incluence-shopping-habits-0593/</ref>.<br />
<br />
The main idea is that when we are in a highly dimensional space <math>\ \mathbb{R}^d</math>, if we do not have enough data (i.e. <math>n \approx d</math>), then it is easy to find a classifier that separates the data across its many dimensions.<br />
<br />
===Different Techniques for Dimensionality Reduction===<br />
* Classical '''Fisher's Discriminant Analysis (FDA)'''<br />
<br />
The goal of FDA is to reduce the dimensionality of data in <math>\ \mathbb{R}^d</math> in order to have separable data points in a new space <math>\ \mathbb{R}^{d-1}</math>.<br />
<br />
* '''Metric Learning (ML)'''<br />
<br />
This is a large family of methods.<br />
<br />
* '''Sufficient Dimensionality Reduction (SDR)'''<br />
<br />
This is also a family of methods. In recent years SDR has been used to denote a body of new ideas and methods for dimension reduction. Like Fisher's classical notion of a sufficient statistic, SDR strives for reduction without loss of information. But unlike sufficient statistics, sufficient reductions may contain unknown parameters and thus need to be estimated.<br />
<br />
* '''Supervised Principal Components (BSPC)'''<br />
<br />
A method proposed by Bair et al. This is a different method from the SPCA method discussed in class despite having a similar name.<br />
<br />
===Metric Learning ===<br />
First define a new metric as:<br />
<br />
<math>\ d_A(\mathbf{x}_i, \mathbf{x}_j)=||\mathbf{x}_i -\mathbf{x}_j|| = \sqrt{(\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j)}</math> <br />
<br />
This metric will only satisfy the requisite properties of a metric if <math>\ A </math> is a positive definite matrix. <br />
This restriction is often relaxed to positive semi-definate. Relaxing this condition may be required if we wish to disregard uninformative covariated. <br />
<br />
''Note 1:'' <math>\ A </math> being positive semi-definite ensures that this metric respects non-negativity and the triangle inequality, but allows <math>\ d_A(\mathbf{x}_i,\mathbf{x}_j)=0</math> to not imply <math>\ \mathbf{x}_i=\mathbf{x}_j</math> <ref name="Xing">Xing, EP. Distance metric learning with application to clustering with side-information. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.7952&rep=rep1&type=pdf]</ref>. <br />
<br />
''Common choices for A'' <br />
<br />
1)<math>\ A=I</math> This represents Euclidean distance. <br />
<br />
2)<math>\ A=D</math> where <math>\ D</math> is a diagonal matrix. The diagonal values can be thought of reweighting the importance of each covariate and these weights are learned can be learned from training data.<br />
<br />
3)<math>\ A=D</math> where <math>\ D</math> is a diagonal matrix with <math>\ D_{ii} = Var(i^{th} covariate)^{-1} </math> This represents scaling down each covariate so that they all have equal variance and thus have equal impact on the distance. This metric is consistant with and works very well for covariates that are independant and normally distributed.<br />
<br />
4)<math>\ A=\Sigma^{-1} </math> where <math>\ \Sigma </math> is the covariance matrix for your set of covariates. This metric is consistant with and works very well for covariates that are normally distributed. The corresponding metric is called Mahalanobis distance.<br />
<br />
When dealing with data that are on different measurement scales using choices 3 or 4 are vastly preferable to Euclidean distance as it prevents covariates with large measurement scales from dominating the metric. <br />
<br />
<br />
For metric learning, construct the Mahalonobis distance over the input space and use it instead of the Euclidean distance. This is really equivalent to transforming the data points using a linear transformation and then computing the Euclidean distance in the new transformed space. To see that this is true, suppose we project each data points on a subspace <math>\ S </math> using <math>\ x^' = U^Tx</math> and calculate the Euclidean distance: <br />
<br />
<math>\ ||\mathbf{x}_i^' - \mathbf{x}_j^'||_2^2= (U^T\mathbf{x}_i -U^T\mathbf{x}_j)^T(U\mathbf{x}_i -U\mathbf{x}_j) = (\mathbf{x}_i -\mathbf{x}_j)^TU^TU(\mathbf{x}_i -\mathbf{x}_j)</math> <br />
<br />
This is the same as Mahalanobis distance in the new space for <math>\ A=UU^T</math>.<br />
<br />
1<br />
One way to find <math>\ A</math> is to consider the set of similar pairs <math>\ (\mathbf{x}_i,\mathbf{x}_j) \in S</math> and the set of dissimilar pairs <math>\ (\mathbf{x}_i,\mathbf{x}_j) \in D</math>. Then we can solve the convex optimization problem below <ref name="Xing" />.<br />
<br />
<math> min_A \sum_{(\mathbf{x}_i,\mathbf{x}_j)\in S} (\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j) </math><br />
<br />
s.t. <math> \sum_{(\mathbf{x}_i,\mathbf{x}_j)\in D} (\mathbf{x}_i - \mathbf{x}_j)^TA(\mathbf{x}_i - \mathbf{x}_j)\ge 1 </math> and <math>\ A</math> positive semi-definite.<br />
<br />
<br />
Overall, the metric learning technique will attempt to minimize the squared induced distance between similar points while maximizing the squared induced distance between dissimilar points and search for a metric which allows points from the same class to be near one another and points from different classes to be far from one another.<br />
<br />
===Sufficient Dimensionality Reduction (SDR)===<br />
<br />
The goal of dimensionality reduction is to find a function <math>\ S(\mathbf{x}) </math> that maps <math>\ \mathbf{x} </math> from <math>\ \mathbb{R}^n </math> to a proper subspace, which means that the dimension of <math>\ \mathbf{x} </math> is being reduced. An example of <math>\ S(\mathbf{x}) </math> would be a function that uses several linear combinations of <math>\ \mathbf{x} </math>.<br />
<br />
For a dimensionality reduction to be sufficient the following condition must hold:<br />
<br />
::<math>\ P_{Y|X}(y|x) = P_{Y|S(X)}(y|S(x)) </math><br />
<br />
Which is equivalent to saying that the distribution of <math>\ y|S(\mathbf{x})</math> is the same as <math>\ y |\mathbf{x} </math> [http://rsta.royalsocietypublishing.org/content/367/1906/4385.full]<br />
<br />
This method aims to find a linear subspace <math>\ R </math> such that the projection onto this subspace preserves <math>\ P_{Y|X}(y|x) </math>.<br />
<br />
Suppose that <math>\ S(\mathbf{x}) = U^T\mathbf{x} </math> is a sufficient dimensional reduction, then<br />
<br />
<math>\ P_{Y|X}(y|x) = P_{Y|U^TX}(y|U^T x) </math><br />
<br />
for all <math>\ x \in X </math>, and <math>\ y \in Y </math>, where <math>\ U^T X </math> is the orthogonal projection of <math>\ X </math> onto <math>\ R </math>.<br />
<br />
====Graphical Motivation====<br />
In a regression setting, it is often useful to summarize the distribution of <math>y|\textbf{x}</math> graphically. For instance, one may consider a scatter plot of <math>y</math> versus one or more of the predictors. A scatter plot that contains all available regression information is called a sufficient summary plot.<br />
<br />
When <math>\textbf{x}</math> is high-dimensional, particularly when the number of features of <math>\ X </math> exceed 3, it becomes increasingly challenging to construct and visually interpret sufficiency summary plots without reducing the data. Even three-dimensional scatter plots must be viewed via a computer program, and the third dimension can only be visualized by rotating the coordinate axes. However, if there exists a sufficient dimension reduction <math>R(\textbf{x})</math> with small enough dimension, a sufficient summary plot of <math>y</math> versus <math>R(\textbf{x})</math> may be constructed and visually interpreted with relative ease.<br />
<br />
Hence sufficient dimension reduction allows for graphical intuition about the distribution of <math>y|\textbf{x}</math>, which might not have otherwise been available for high-dimensional data.<br />
<br />
Most graphical methodology focuses primarily on dimension reduction involving linear combinations of <math>\textbf{x}</math>. The rest of this article deals only with such reductions.[http://en.wikipedia.org/wiki/Sufficient_dimension_reduction#Graphical_motivation]<br />
<br />
====Other Methods for Reduction====<br />
Two very common examples of SDR are Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE). More information on SIR can be found here [http://en.wikipedia.org/wiki/Sliced_inverse_regression]. In addition [http://mars.wiwi.hu-berlin.de/mediawiki/teachwiki/index.php/Sliced_Inverse_Regression] also provides some examples for SIR.<br />
<br />
===Supervised Principal Components (BSPC)===<br />
<br />
BSPC algorithm:<br />
<br />
1. Compute (univariate) standard regression coefficients for each feature j using the following formula:<br />
<br />
<math>\ s_j=\frac{{X_j}^TY}{\sqrt{X_j^T X_j}} </math><br />
<br />
2. Reduce the data matrix <math>Xo </math> corresponding to all the columns where <math>\ |S_j|>\theta</math>. Find <math>\ \theta</math> by cross-validation. <br />
<br />
3. Compute the first principal component of the reduced data matrix <math>Xo </math><br />
<br />
4. Use the principal component calculated in step (3) in a regression model or a classification algorithm to produce the outcome<br />
<br />
<br />
Bair's SPCA is consistent. In Normal PCA as the number of data points increases PCA takes different directions for the components. However the direction of the first component of SPCA remains consistent as the number of points increase <ref>Bair E., Prediction by supervised principal components. [http://stat.stanford.edu/~tibs/ftp/spca.pdf]</ref>.<br />
<br />
===Hilbert-Schmidt Independence Criterion (HSIC)===<br />
"Hilbert-Schmidt Norm of the Cross-Covariance operator" is proposed as an independence criterion in reproducing kernel Hilbert spaces (RKHSs).<br />
<br />
The measure is refered to as '''Hilbert-Schmidt Indepence Criterion (HSIC)'''.<br />
<br />
Let <math>\ z=\{(x_1,y_1),...,(x_n,y_n)\} \in\ \mathcal{X}</math>x<math>\mathcal{Y}</math> be a series of <math>\ n</math> independent observation drawn from <math>\ P_{(X,Y)}(x,y)</math> . An estimator of HSIC is given by <br />
<br />
<math>HSIC=\frac{1}{(n-1)^2}Tr(KHBH)</math> <br />
<br />
where H,K,B <math>\in\mathbb{R}^{n x n}</math><br />
<br />
<math>K_{ij} =k(x_i,x_j),B_{ij}=b(y_i,y_j), H=I-\frac{1}{n}\boldsymbol{e} \boldsymbol{e}^{T} </math>, where <math>\ k</math> and <math>\ b</math> are positive semidefinite kernel functions, and <math>\ \boldsymbol{e} = [1 1 \ldots 1]^T</math>.<br />
<br />
XH is centralized version of X ( subtracting the mean of each row):<br />
<br />
<math>XH=X(I- \frac{1}{n}\boldsymbol{e} \boldsymbol{e}^T)=X -\frac{1}{n}X\boldsymbol{e} \boldsymbol{e}^T</math> where each entry in row i of <math>\frac{1}{n}Xee^T</math> is mean of <math>i^{th}</math> row of X<br />
<br />
<math>HBH</math> is double centeralized version of B (subtracting mean of row and column)<br />
<br />
We introduced a way of measuring independence between two distributions. The key idea is that good features should maximize such dependence. Feature selection for various supervised learning problems is unified under HSIC, and the solutions can be approximated using a backward-elimination algorithm. To explain this, we started by explaining how to tell if two distributions are same. Specifically, if two distributions have different mean values, then we can say right away that these are two different distributions. However, if they share a same mean value, then we need to look at second moments of these distributions, from which we can derive variance. Hence we need to look at higher dimension to tell if two distributions are equal.<br />
<br />
It can be mathematically shown(although not done in class) that if we define a mapping,<math>\ \phi </math> of random variable X, which maps X to higher dimension, then there exists a unique mapping between the <math>\ \mu_x</math>, which is the average of x in the higher dimension, and the distribution of X. This suggests that <math>\ \mu_x</math> can reproduce the distribution of P.<br />
<br />
Hence to figure out if two random variables X and Y have the same distribution, we can take the difference between E<math>\ \phi </math>(x) and E<math>\ \phi</math>(y), and take the norm of this to see if two distributions are equal.<br />
i.e.<br />
<math>|| E \phi (x) - E\phi(y) ||^2</math><br />
If this value is equal to 0, then we know that they have the same distribution.<br />
<br />
Now to test the independence of <math>\ P_x</math> and <math>\ P_y</math> then we can use the previous formula on <math>\ P_{xy}</math> and <math>\ (P_x)(P_y)</math> - if it equals 0, then two distributions <math>\ P_x</math> and <math>\ P_y</math> are independent. The larger the difference is, then the distributions of X and Y are more different.<br />
<br />
Utilizing this, we can find the <math>\ U^TX </math> from <math>\ P(Y|X)=P(Y|U^TX) </math> such that it maximizes the HSIC between <math>\ Y</math>, which implies the maximum dependence between <math>\ U^TX </math> and <math>\ Y</math>.<br />
<br />
<br />
you come up with index called HSIC:<br />
<br />
<math>\ KHBH </math><br />
<br />
X, Y random variables.<br />
<br />
K- kernel matrix over X.<br />
<br />
B- kernel matrix over Y.<br />
<br />
==='''Kernel Function'''===<br />
A positive definite kernel can always be written as inner products of a feature mapping.<br /><br />
To prove a valid kernel function:<br /><br />
1. define a feature <math> \phi(x) </math> mapping into some vector space.<br /><br />
2. define a dot product in a strictly positive definite form<br /><br />
3. Show that <math>\ k(x, x') = <\phi(x),\phi(x')></math><br /><br />
[http://www.public.asu.edu/~ltang9/presentation/kernel.pdf]</ref>.<br />
<br />
Kernel function will be used when calculating <math>\|| E\phi(x) - E\phi(y) ||^2</math><br />
The possible kernel functions we can choose are:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
* Delta Kernel: <math>\,k(x_i,x_j) =<br />
\begin{cases}<br />
1 & \text{if }x_i=x_j \\ 0 & \text{if }x_i\ne x_j<br />
\end{cases}<br />
</math><br />
<br />
H is a constant matrix of the form: <math>\ H = I - \frac{1}{n}ee^T </math><br />
<br />
where, <math>\ e = \left( \begin{array}{c}1<br />
<br />
\\ \\<br />
\vdots \\ \\<br />
1 \\ \\<br />
1 \end{array} \right) </math>.<br />
<br />
H centralizes any matrix that you multiply it to.<br />
So HBH makes B double centred<br />
<br />
<br />
We wanted the transformation <math>\ U^TX </math> such that it had the maximum dependance to Y. So we use the index HSIC to find the dependance between U^TX and Y and maximize it.<br />
<br />
'''H''' centralize the mean of X by XH<br />
<math>X-\mu</math>: the larger the value is, they dependence more of each other.<br />
<br />
So basically we want to maximize <math>\ Tr(KHBH)</math><br />
<br />
<math>\ max Tr(KHBH)</math><br />
<br />
<math>\ max Tr(X^TUU^TXHBH)</math><br />
<br />
<math>\ max Tr(U^TXHBHX^TU)</math><br />
<br />
we add a constraint to solve this problem <br />
<br />
<math>\ U^TU=I</math><br />
<br />
Then this is identical to PCA if <math>\ B=I</math><br />
<br />
===SPCA: Supervised Principle Compenent Analysis===<br />
<br />
We need to find <math>\ U </math> to maximize <math>\ Tr(HKHB) </math> <br />
where K is a Kernel of <math>\ U^T X </math> (eg: <math>\ X^T UU^T X </math>) and <math>\ B </math> is a Kernel of <math>\ Y </math>(eg: <math>\ Y^T Y </math>):<br />
<br />
{| class="wikitable" cellpadding="5"<br />
|- align="center" <br />
! <math>\ X </math><br />
! <math>\ Y </math><br />
|- align="center"<br />
| <math>\ U^T X </math><br />
| <math>\ Y </math><br />
|-<br />
| <math>\ (U^T X)^T (U^T X) = X^T UU^T X </math><br />
| <math>\ B </math><br />
|}<br />
<br />
<math>\max \; Tr(HKHB) </math><br />
<math>\ \; \; = \; \max Tr(HX^T UU^T XHB) </math><br />
<math>\ \; \; = \; \max Tr(U^T XHBHX^T U) </math><br />
<math>\ subject \; to \; U^T U = I </math><br />
<br />
===Supervised Principle Components Analysis and Conventional PCA===<br />
<br />
[[File:012DR-PCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using PCA]]<br />
[[File:012DR-SPCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using Supervised PCA]]<br />
<br />
<br />
This is idential to PCA if B = I<br />
<br />
<math>(XHBHX^T) = cov(x) = (x-\mu)(x-\mu)^T</math><br />
<br />
===SPCA===<br />
Algorithm 1 <br /><br />
- Recover basis: Calculate <math>Q=XHBHX^T</math> and let u=eigenvector of Q corresponding to the top d eigenvalues.<br /><br />
- Encode training data: <math>Y=U^TXH</math> where Y is dxn matrix of the original data <br /><br />
- Reconstruct training data: <math>\hat{X}=UY=UU^TX</math> <br /><br />
- Encode test example: <math>y=U^T(x-\mu)</math> where y is a d dimensional encoding of x. <br /><br />
- Reconstruct test example: <math>\hat{X}=U_y=UU^T(x-\mu)</math> <br /><br />
<br />
Find U that would maximize <math>Tr(HKHB)</math> where K is a kernel of <math>U^TX</math> (e.g. <math>K=x^Tuu^Tx</math>) and B is a kernel of Y (e.g. <math>B=y^Ty</math>).<br />
<br />
<math><br />
max_U Tr(KHBH) <br />
= max_U Tr(x^Tuu^TxHBH) <br />
= max_U Tr(u^TxHBHx^Tu) </math> since we can switch the order around for traces<br />
<br />
===Dual Supervised Principle Component Analysis===<br />
<br />
<br />
Let <math>Q = XHBHX^T</math> and B are both PSD<br />
<br />
<math>Q = \psi\psi^T</math><br />
<math>B = \Delta\Delta^T</math><br />
<math>\psi = XH\Delta^T</math><br />
<br />
The solution for U can be expressed as singular value decomposition (SVD) of <math>\psi</math>:<br />
<br />
<math>\psi = U \Sigma V^T</math><br />
<math>\rightarrow \psi V = U \Sigma</math><br />
<math>\rightarrow \psi V \Sigma^-1 = U</math><br />
<math>\rightarrow \Sigma^{-1} V^T \psi^T XH </math><br />
<math>\rightarrow \Sigma^{-1} V^T V \Sigma^T U^T XH </math><br />
<br />
It gives a relationship between V and U. Your can replace these in the algorithm above and define everything based on V instead of U. By doing this you do not need to find eigenvectors of Q which have a high dimensionality.<br />
<br />
<br />
Algorithm 2 <br /><br />
Recover basis: calculate <math>\psi^T \psi</math> and let V=eigenvector of <math>\psi^T \psi</math> corresponding to the top d eigenvalues. Let <math>\Sigma</math>=diagonal matrix of square roots of the top d eigenvalues. <br /><br />
<br />
Reconstruct training data:<br />
<math>\hat{X}=UZ=XH\Delta^T V \Sigma^{-2}V^T\Delta H(X^T X)H </math> <br /><br />
<br />
Encode test examples: <math>y=U^T(x-\mu)=\Sigma^{-1}V^T \Delta H[X^T(x-\mu)] </math> where y is a d dimensional encoding of x.<br />
<br />
===Towards a Unified Network===<br />
<br />
{| class="wikitable"<br />
|-<br />
! <br />
! B<br />
! Constraint<br />
! Component<br />
|-<br />
| PCA<br />
| I<br />
| <math>\omega^T \omega = I</math><br />
| <br />
|-<br />
| FDA<math>^{(1)}</math><br />
| <math>B_0</math><br />
| <math>\omega^T S_\omega \omega = I</math><br />
| <math>S_\omega = X B_s X^T</math><br />
|-<br />
| CFML I<math>^{(2)}</math><br />
| <math>B_0 - B_s</math><br />
| <math>\omega^T \omega = I</math><br />
| <br />
|-<br />
| CFML II<math>^{(2)}</math><br />
| <math>B_0</math><br />
| <math>\omega^T S_\omega \omega = I</math><br />
| <math>S_\omega = X B_s X^T</math><br />
|}<br />
(1)<math>B_s=F(F^{T}F)^{-1}F^T</math>, (2) <math>B_s=\tfrac{1}{n}FF^{T}</math> ,<math>B_D=H-B_s</math>, <math>n</math> # of data points,<br />
<math>F</math> indicator matrix of cluster, <math>H</math> the centering matrix<br />
<br />
===Dual Supervised PCA===<br />
{| class="wikitable"<br />
|-<br />
! <br />
! B<br />
! Constraint<br />
! Component<br />
|-<br />
| KPCA<br />
| I<br />
| <math>UU^T = I</math><br />
| Arbitrary<br />
|-<br />
| K-means<br />
| I<br />
| <math>UU^T = I, U\ge 0</math><br />
| Linear<br />
|}<br />
<br />
== Boosting (Lecture: Nov. 10, 2011) ==<br />
<br />
Boosting is a meta-algorithm for starting with a simple classifier and improving the classifer by refitting the data giving higher weight to misclassified samples. <br />
<br />
<br />
Suppose that <math>\mathcal{H}</math> is a collection of classifiers. Assume that <br />
<math>\ y_i \in \{-1, 1\} </math> and that each <math>\ h(x)\in \{-1, 1\} </math>. Start with <math>\ h_1(x) </math>. Based on how well <math>\ h_1 (x) </math> classifies points, adjust the weights of each input and reclassify. Misclassified points are given higher weight to ensure the classifier "pays more attention" to them, to fit better in the next iteration. The idea behind boosting is to obtain a classification rule from each classifer <math> h_i(x)\in\mathcal{H}</math>, regardless of how well it classifies the data on its own (with the proviso that its performance be better than chance), and combine all of these rules to obtain a final classifier that performs well. <br />
<br />
[[File:boosting1.jpg]]<br />
<br />
<br />
An intuitive way to look at boosting and the concept of weight is to think about extreme weightings. Suppose you are doing classification on a set with some points being misclassified. Suppose that any points that have been classified correctly are to be removed from the data. So the weak classifier may do a good job on these new data. This is how early versions of boosting worked, instead of re-weighting. <br />
<br />
=== AdaBoost ===<br />
'''Adaptive Boosting (AdaBoost)''' was formulated by Yoav Freund and Robert Schapire. AdaBoost is defined as an algorithm for constructing a “strong” classifier as linear combination <math>f(\mathbf{x}) = \sum_{t=1}^T \alpha_t h_t(\mathbf{x}) </math> of simple “weak” classifiers <math>\ h_t(\mathbf{x})</math>. It is very popular and widely known as the first algorithm that could adapt to weak learners <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf </ref>. <br />
<br />
It has the following properties:<br />
<br />
* It is a linear classifier with all its desirable properties<br />
* It has good generalization properties<br />
* It is a feature selector with a principled strategy (minimisation of upper bound on empirical error)<br />
* It is close to sequential decision making<br />
<br />
====Algorithm Version 1====<br />
The AdaBoost algorithm presented in the lecture is as follows (for more info see [http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf]):<br />
<br />
1 Set the weights <math>\ w_i=\frac{1}{n}, i = 1,...,n. </math> <br /><br />
<br />
2 For <math>\ j =1,...,J </math>, do the following steps:<br />
<br />
:a) Find the classifier <math>\ h_j: \mathbf{x} \rightarrow \{-1,1\} </math> that minimizes the weighted error <math>\ L_j </math>:<br />
<br />
:<math>\ h_j= arg \underset{h_j\in \mathcal{H}}{\mbox{min}} L_j</math><br />
<br />
:where <math>\ L_j = \frac{\sum_{i=1}^{n}w_iI[y_i\ne h_j(x_i)]}{\sum_{i=1}^{n} w_i}</math><br />
<br />
:<math>\ H </math> is a set of classifiers which need to be improved and <math>\ I</math> is<br />
::<math>\, I= \left\{\begin{matrix} <br />
1 & for \quad y_i\neq h_j(\mathbf{x}_i) \\ <br />
0 & for \quad y_i = h_j(\mathbf{x}_i) \end{matrix}\right.</math><br /><br />
<br />
:b) Let <math>\alpha_j= log(\frac{1-L_j}{L_j})</math><br />
<br />
::Note that <math>\ \alpha</math> indicates the "goodness" of the classifier, where a larger <math>\ \alpha</math> value indicates a better classifier. Also, <math>\ \alpha</math> is always 0 or positive as long as the classification accuracy is 0.5 or higher. For example, if working with coin flips, then <math>\ L_j=0.5 </math> and <math>\ \alpha=0</math>.<br />
<br />
:c) Update the weights:<br />
<br />
::<math>\ w_i \leftarrow w_i e^{\alpha_j I[y_i\ne h_j(\mathbf{x}_i)]}</math><br />
::Note that the weights are only increased for points that have been misclassified by a good classifier.<br /> <br />
<br />
3 The final classifier is: <math>\ h(\mathbf{x}) = sign (\sum_{j=1}^{J}\alpha_j h_j(\mathbf{x}))</math>. <br />
<br />
:Note that this is basically an aggregation of all the classifiers found and the classification outcomes of better classifiers are weighted more using <math>\ \alpha</math>.<br />
<br />
====Algorithm Version 2 <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf</ref>====<br />
One of the main ideas of this algorithm is to maintain a distribution or set of weights over the training set. Initially, all weights are set equally, but on each round, the weights of incorrectly classified examples are increased so that the weak learner is forced to focus on the hard examples in the training set.<br />
<br />
* Given <math>\left(\mathbf{x}_1,y_1\right),\dots,\left(\mathbf{x}_m,y_m\right)</math> where <math>{\mathbf{x}_i \in X}</math>, <math>{y_i \in \{-1,+1\}}</math>.<br />
* Initialize weights <math>D_1(i) = \frac{1}{m}</math><br />
* Iterate <math>t=1,\dots, T</math><br />
** Train weak learner using distribution <math>\ D_t</math><br />
** Get weak classifier: <math>h_t:X\rightarrow R</math><br />
** Choose <math>{\alpha_t \in R}</math><br />
** Update the weights: <math>D_{t+1}(i) = \frac {D_i e^{-\alpha_t y_i h_t(\mathbf{x}_i)}} {Z_t}</math><br />
:: where <math>\ Z_t</math> is a normalization factor (chosen so that <math>\ D_t+1</math> will be a distribution)<br />
* The final classifier is:<br />
:: <math>H(\mathbf{x})=\mbox{sign}\left(\sum_{t=1}^T \alpha_t h_t(\mathbf{x})\right)</math><br />
<br />
====Example====<br />
<br />
In R, we can do boosting on a simulated classifer. Suppose we are working with the built-in R dataset "iris". These data consist of petal length, sepal length, petal width, and sepal width of three different species of iris. This is an adaptive boosting algorithm as applied to these data.<br />
<br />
<pre style = "align:left; width:100%; padding: 2% 2%"><br />
> crop1 <- iris[1:100,1] #the function "ada" will only handle two classes<br />
> crop2 <- iris[1:100,2] #and the iris dataset has 3. So crop the third off.<br />
> crop3 <- iris[1:100,3]<br />
> crop4 <- iris[1:100,4]<br />
> crop5 <- iris[1:100,5] #This is the response variable, indicating species of iris<br />
> x <- cbind(crop1, crop2, crop3, crop4, crop5) #combine all the columns<br />
> fr1 <- as.data.frame(x, row.names=NULL) #and coerce into a data frame<br />
> <br />
> a = 2 #number of iterations<br />
> AdaBoostDiscrete <- ada(crop5~., data=fr1, iter=a, loss="e", type = "discrete", control = rpart.control())<br />
> AdaBoostDiscrete <br />
Call:<br />
ada(crop5 ~ ., data = fr1, iter = a, loss = "e", type = "discrete", <br />
control = rpart.control())<br />
<br />
Loss: exponential Method: discrete Iteration: 2 <br />
<br />
Final Confusion Matrix for Data:<br />
Final Prediction<br />
True value 1 2<br />
1 50 0<br />
2 0 50<br />
<br />
Train Error: 0 <br />
<br />
Out-Of-Bag Error: 0 iteration= 1 <br />
<br />
Additional Estimates of number of iterations:<br />
<br />
train.err1 train.kap1 <br />
1 1 <br />
<br />
> #Since this yields "perfect" results, we may not need boosting here after all.<br />
> #This was just an illustration of the ada function in R.<br />
</pre><br />
<br />
====Advantages and Disadvantages====<br />
The advantages and disadvantages of AdaBoost are listed below.<br />
<br />
Advantages :<br />
* Very simple to implement <br />
* Fairly good generalization<br />
* The prior error need not be known ahead of time<br />
<br />
Disadvantages:<br />
* Suboptimal solution<br />
* Can over fit in presence of noise<br />
<br />
===Other boosters===<br />
There are many other more recent boosters such as LPBoost, TotalBoost, BrownBoost, MadaBoost, LogitBoost, stochastic boost etc. The main difference between many of them is the way they weigh the points in the training data set at each iteration. Some of these boosters, such as AdaBoost, MadaBoost and LogitBoost, can be interpreted as performing a gradient descent to minimize the convex cost function (They fit into the AnyBoost framework). However, a recent research study showed that this class of boosters are prone to random classification noise, thereby questioning their applicability to real world noisy classification problems. <ref>Pillip M. Long, Rocco A. Servedio, "Random Classification Noise Defeats All Convex Potential Boosters", 2000</ref><br />
<br />
=== Relation to SVM ===<br />
SVM and Boosting are very similar except for the way to measure the margin or the way they optimize their weight vector. SVMs use the <math>l_2</math> norm for both the instance vector and the weight vector, while Boosting uses the <math>l_1</math> norm for the weight vector. ie. SVMs need to use the <math>l_2</math> norm to implicitly compute scalar products in feature space with the help of the kernel trick. No other norm can be expressed in terms of scalar products.<br />
<br />
Although SVM and AdaBoost share some similarities. However, there are several important differences:<br />
* Different norms can result in very different margins: In boosting or in SVM, the dimension is usually very high, this makes the difference between <math>l_1</math> norm and <math>l_2</math> norm can be significant enough in the margin values.<br />
<br />
e.g suppose the weak hypotheses all have range {-1,1} and that the label y on all examples can be computed by a majority vote of k of the weak hypotheses. In this case, it can be shown that if the number of relevant weak hypotheses is a small fraction of the total number of weak hypotheses then the margin associated with AdaBoost will be much larger than the one associated with support vector machines.<br />
<br />
* The computation requirements are different: The difference between the two methods in this regard is that SVM cor-responds to quadratic programming, while AdaBoost corresponds only to linear programming.<br />
<br />
* A different approach is used to search efficiently in high dimensional space: SVM deals with overfitting problem through the method of kernels which allow algorithms to perform low dimensional calculations that are mathematically equivalent to inner products in a high dimensional “virtual” space. While, boosting approach often employ greedy search method.<ref>http://www.iuma.ulpgc.es/camellia/components/com_docman/dl2.php?archive=0&file=c3ZtX2FuZF9ib29zdGluZ19vbmUucGRm</ref><br />
<br />
== Bagging ==<br />
<br />
[[File: Bagging.jpg|250px|thumb|When bagging, we split up the data, train separate classifiers and then recreate a final classifier]]<br />
<br />
'''Bagging (Bootstrap aggregating)''' was proposed by Leo Breiman in 1994. Bagging is another meta-algorithm for improving classification results by combining the classification of randomly generated training sets. [http://www.wikicoursenote.com/wiki/Stat841f10.htm#Bagging][http://en.wikipedia.org/wiki/Bootstrap_aggregating]<br />
<br />
<br />
<br />
The idea behind bagging is very similar to that behind boosting. However, instead of using multiple classifiers on essentially the same dataset (but with adaptive weights), we sample from the original dataset containing m items B times with replacement, obtaining B samples each with m items. This is called bootstrapping. Then, we train the classifier on each of the bootstrapped samples. Taking a majority vote of a combination of all the classifiers, we arrive at a final classifier for the original dataset. [http://www.cs.princeton.edu/courses/archive/spr07/cos424/assignments/boostbag/index.html]<br />
<br />
Bagging is the effective intensive procedure that can improve on unstable classifiers. It is most useful for highly nonlinear classifiers, such as trees. <br />
<br />
As we know the idea of boosting is to incorporate unequal weights in learning h given higher weight to misclassified points. Bagging is a method for reducing the variability of a classifier. The idea is to train classifiers <math>\ h_{1}(x)</math> to <math>\ h_{B}(x)</math> using B bootstrap samples from the data set. The final classification is obtained using an average or 'plurality vote' of the B classifiers as follows:<br />
<br />
<br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 & \frac{1}{B} \sum_{i=1}^{B} h_{b}(x) \geq \frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
=== Boosting vs. Bagging ===<br />
<br />
• boosting can help us do the procedure on stable models, but bagging may not work for stable models.<br />
<br />
• bagging is easier to parallelize and more helpful in practice.<br />
<br />
• Many classifiers, such as trees, already have underlying functions that estimate the class probabilities at x. An alternative strategy is to average these class probabilities instead of the final classifiers. This approach can produce bagged estimates with lower variance and usually better performance.<br />
<br />
• Bagging doesn’t work so well with stable models.Boosting might still help.<br />
<br />
• Boosting might hurt performance on noisy datasets. Bagging doesn’t have this problem.<br />
<br />
• In practice bagging almost always helps.<br />
<br />
• On average, boosting usually helps more than bagging, but it is also more common for boosting to hurt performance.<br />
<br />
• The weights grow exponentially.<br />
<br />
• Bagging is easier to parallelize.<br />
<br />
== Decision Trees ==<br />
<br />
<br />
[[File: simple_decision_tree.jpg|right|frame|A basic example of a decision tree, iteratively ask questions to navigate the tree until we reach a decision node.]]<br />
<br />
'''Decision tree learning''' is a method commonly used in statistics, data mining and machine learning. The goal is to create a model that predicts the value of a target variable based on several input variables. It is a very flexible classifier, can classify non-linear data and it can be used for classification, regression, or both. A tree is usually used as a visual and analytical decision support tool, where the expected values of competing alternatives are calculated. <br />
<br />
<br />
It uses principle of divide and conquer for classification. The trees have traditionally been created manually. Trees map features of a decision problem onto a conclusion, or label. We fit a tree model by minimizing some measure of impurity. For a single covariate <math>\ X_1 </math> we choose a point t on the real line that splits the real line into two sets <math>\ R_1 = (-\infty, t] , R_2 = [ t, \infty) </math> in a way that minimizes impurity.<br /><br />
<br />
[[File: p.jpg|right|frame|Node impurity for two-class classification, as a function of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5,0.5).]]<br />
<br />
Let <math>\hat{p_s}(j) </math> be the proportion of observations in <math>\boldsymbol R_s </math> such that <math>\ Y_i = j</math> <br /><br />
<br />
<math>\hat{p_s}(j) = \frac {\sum_{i=1}^n I(Y_i = j, X_i \in \boldsymbol R_s)}{\sum_{i=1}^n I(x_i \in \boldsymbol R_s)}</math><br /><br />
<br />
<br />
Node impurity measures (see figure to the right):<br />
<br />
:Misclassification error: <math>\ 1 - \hat{p_s}(j) </math><br /><br />
:Gini index:<math>\sum_{j \neq 1} \hat{p_s}(j)\hat{p_s}(i)</math><br />
<br />
'''Limitions in Decision Trees'''<br />
<br />
1. Overfitting problem:<br />
Decision Trees are extremely flexible models; this flexibility means that they can easily perfectly match any training set. This makes overfitting a prime consideration when training a decision tree. There is no robust way to avoid fitting noise in the data but two common approaches include:<br />
<br />
* do not fit all trees, stop when the training set reaches perfection<br />
* fully grow the tree and then prune the resulting tree. Pruning algorithms include cost complexity pruning, minimum description length pruning and pessimistic pruning. This results in a tree with less branches, which can generalize better. <ref>J. R. Quinlan, Decision Trees and Decision Making, IEEE Transactions on Systems, Man and Cybernetics, vol 20, no 2, March/April 1990, pg 339-346.</ref><br />
<br />
<br />
2. time-consuming and complex: <br />
compare to other decision-making models, decision trees is a relatively easier tool to use, however, if the tree contains a large amount branches, it will become complex in nature and take time to solve the problem. <br />
Moreover, decision trees only examine a single field at a time, which leads to rectangular classification boxes. And the complexity adds costs to train people to have the extensive knowledge to complete the decision tree analysis. <ref><br />
http://www.brighthub.com/office/project-management/articles/106005.aspx<br />
</ref><br />
<br />
<br />
Some specific decision-tree algorithms:<br />
* ID3 algorithm [http://en.wikipedia.org/wiki/ID3_algorithm]<br />
* C4.5 algorithm [http://en.wikipedia.org/wiki/C4.5_algorithm]<br />
* C5 algorithm<br />
<br />
A comparison of bagging and boosting methods using the decision trees classifiers: [http://www.doiserbia.nb.rs/img/doi/1820-0214/2006/1820-02140602057M.pdf]<br />
<br />
=== CART (Classifcation and Regression Tree)===<br />
<br />
The '''Classification and Regression Tree (CART)''' is a non-parametric Decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. (Wikipedia) The CART is good in working with outliers during the process. CART will isolate the outliers in a separate node.<br />
<br />
Advantages<ref>http://www.statsoft.com/textbook/classification-and-regression-trees/</ref>:<br />
* '''Simplicity of results'''. In most cases the results are summarized in a very simple tree. This is important for fast classification and for creating a simple model for explaining the observations.<br />
* '''Tree methods are nonparametric and nonlinear'''. There is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable is linear or monotonic. Thus tree methods are well suited to data mining tasks where there is little a priori knowledge of any related variables.<br />
<br />
===Advantages and Disadvantages===<br />
<br />
Decision Tree Advantages <br />
<br />
1. Easy to understand <br />
<br />
2. Map nicely to a set of business rules <br />
<br />
3. Applied to real problems <br />
<br />
4. Make no prior assumptions about the data <br />
<br />
5. Able to process both numerical and categorical data <br />
<br />
Decision Tree Disadvantages <br />
<br />
1. Output attribute must be categorical <br />
<br />
2. Limited to one output attribute <br />
<br />
3. Decision tree algorithms are unstable <br />
<br />
4. Trees created from numeric datasets can be complex<br />
<br />
Read more: http://wiki.answers.com/Q/List_the_advantages_and_disadvantages_for_both_decision_table_and_decision_tree#ixzz1dNGFaOpi<br />
<br />
===Ranking Features===<br />
In implementation of a tree model it is important how the features are ranked (i.e. in what order the features appear in the tree). The general way to do this is to choose the features with the highest dependence on Y to be the first feature in the tree and then going down the tree with lower dependence.<br />
<br />
'''Feature ranking strategies'''<br />
<br />
1. Fisher score (F-score)<br />
* simple in nature<br />
* efficient in measuring the the discrimination between a feature and the label.<br />
* independent of the classifier.<br />
<br />
2 Linear SVM Weight<br />
<br />
The following is an algorithm based on linear SVM weights:<br />
<br />
* input the training sets: <math>(x_i, y_i), i = 1, \dots l</math> <br />
* obtain the sorted feature ranking list as output:<br />
** Using grid search to find the best parameter C. <br />
** Training a <math>L2-</math>loss linear SVM model using the best available C.<br />
** Then features can be sorted according to the absolute values of weights.<br />
<br />
3. Change of AUC with/without Removing Each Feature<br />
<br />
4. Change of Accuracy with/without Removing Each Feature<br />
<br />
5. Normalized [http://en.wikipedia.org/wiki/Information_gain Information Gain] (difference in entropy)<br />
<br />
note: for details, please read <ref><br />
http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf<br />
</ref><br />
<br />
===Random Forest=== <br />
Decision trees are unstable. An application of bagging is to combine trees into random forest. A random forest is a classifier consisting of a collection of tree-structured classifiers <math>\left \lbrace \ h(x, \Theta_k ), k = 1, . . . \right \rbrace</math> where the <math>{\Theta_k } </math> are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input <math>x</math> <ref>Breiman N., Random Forests ''Machine learning'' [http://www.springerlink.com/content/u0p06167n6173512/fulltext.pdf]</ref>.<br />
<br />
In a random forest, the trees are grown quite similarly to the standard classification tree. However, no pruning is done in the random forest technique. <br />
<br />
Compared with other methods, random forests have some positive characteristics:<br />
<br />
* runs faster than bagging or boosting<br />
* has similar accuracy as Adaboost, and sometimes even better than Adaboost<br />
* relatively robust to noise<br />
* delivers useful estimates of error, correlation<br />
<br />
For larger data sets, more accuracy can be obtained by combining random features with boosting.<br />
<br />
'''This is how a single tree is grown:'''<br />
<br />
First, suppose the number of elements in the training set is K. We then sample K elements with replacement. <br />
Second, if there are a total of N inputs to the tree, choose an integer n << N such that for each node of the tree, n variables are randomly selected from N. the best split on these n variables is used to allow the node to make a decision (hence a "decision tree"). <br />
Third, grow the tree as large as possible. <br />
<br />
Each tree contributes one classification. That is, each tree gets one "vote" to classify an element. The beauty of random forest is that all of these votes are added up, similar to boosting, and the final decision is the result of the vote. This is an extremely robust algorithm. <br />
<br />
There are two things that can contribute to error in random forest: <br />
<br />
1. correlation between trees <br />
2. the ability of an individual tree to classify well. <br />
<br />
This is seen intuitively, since if many trees are very similar to one another, then it is likely they will all classify the elements in the same way. If a single tree is not a very good classifier, it does not matter in the long run because the other trees will compensate for its error. However, if many trees are bad classifiers, the result will be garbage.<br />
<br />
To avoid both of the above problems, there is an algorithm to optimize n, the number of variables to use in each decision tree. Unfortunately, an optimal value is not found on its own; instead, an optimal range is found. Thus, to properly program a random forest, there is a parameter that must be "tuned". Looking at various types of error rate, this is easily found (we want to minimize error, as characterized by the Gini index, or the misclassification rate, or the entropy). [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro]<br />
<br />
An algorithm for the Random Forest can be described as below: we first let <math>N_trees</math> to be the number of trees need to build for each of <math>N_trees</math> iterations, then we select a new bootstrap sample from training set and grow an un-pruned tree on this bootstrap, next, at each internal node, randomly select m predictors and determine the best split using only these predictors. Finally do not perform cost complexity pruning and save tree as is, along side those built thus far. <ref><br />
Albert A. Montillo,Guest lecture: Statistical Foundations of Data Analysis "Random Forests", April,2009. <http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf><br />
</ref><br />
<br />
===Further Reading===<br />
<br />
Boosting: <ref>Chunhua Shen; Zhihui Hao. “A direct formulation for totally-corrective multi-class boosting”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2011.</ref><br />
<br />
Bagging: <ref>Xiaoyuan Su; Khoshgoftarr, T.M.; Xingquan Zhu. “VoB predictors: Voting on bagging classifications”. 19th IEEE International Conference on Pattern Recognition. 2008.</ref><br />
<br />
Decision Tree: <ref> Zhuowen Tu. “Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering”. Tenth IEEE International Conference on Computer Vision. 2005.</ref><br />
<br />
== Graphical Models ==<br />
<br />
A graphical model is a probabilistic model for which a graph denotes the conditional independence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.(Wikipedia)<br />
<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
<br />
<br />
{| class="wikitable"<br />
|-<br />
! Y<br />
! Y<br />
|-<br />
| <math>\downarrow</math><br />
| <math>\uparrow</math><br />
|-<br />
| Generative LDA<br />
| Linear Discrimanation<br />
|}<br />
<br />
Probabilistic ''Discriminative'' Models: Model posterior probability P(Y|X) directly (example: LDA).<br />
<br />
Advantages of discriminative models<br />
* Obtain desired posterior probability directly<br />
* Less parameters<br />
<br />
''Generative'' Model: Compute posterior probabilities using Bayes Rule - class-conditional densities and class priors. <ref>http://www.google.ca/imgres?q=generative+vs+discriminative+model&hl=en&client=firefox-a&hs=9tQ&sa=X&rls=org.mozilla:en-GB:official&biw=1454&bih=840&tbm=isch&prmd=imvns&tbnid=GZd3ZvkGOWmvnM:&imgrefurl=https://liqiangguo.wordpress.com/2011/05/26/discriminative-model-vs-generative-model/&docid=9D6p6EAceYNlSM&imgurl=http://liqiangguo.files.wordpress.com/2011/05/d_g1.jpg&w=938&h=336&ei=4pjBTrmjOqHc0QG-u_WCAw&zoom=1&iact=hc&vpx=369&vpy=193&dur=203&hovh=72&hovw=202&tx=162&ty=89&sig=116704843266645309182&page=1&tbnh=72&tbnw=202&start=0&ndsp=25&ved=1t:429,r:1,s:0</ref><br />
<br />
Advantages of generative models:<br />
*Can generate new points<br />
*Can sample a new point<br />
<br />
for an introduction to Graphical model you can see: [http://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf]<br />
<br />
=Boltzmann Machines=<br />
<br />
==Introduction==<br />
<br />
[[Image:GBMRBM.jpg|thumb|200px|right|Reference: [2]]]<br />
<br />
Boltzmann machines are networks of connected nodes which, using a stochastic decision-making process, decide to be on or off. These connections need not be directive; that is, they can go back and forth between layers. This type of formulation leads the reader to think immediately of a binomial distribution, with some probability p of each node being on or off. In a classification problem, a Boltzmann Machine is presented with a set of binary vectors, each entry of the vector called a “unit”, with the goal of learning to generate these vectors. [1]<br />
<br />
Similar to the neural networks already discussed in class, a Boltzmann Machine must assign weights to inputs, compute some combination of the weights times contributing node values, and optimize the weights such that a certain cost function (such as the relative entropy, as discussed later) is minimized. The cost function depends on the complexity of the model and the “correctness” of the classification. The main idea is to make small updates in the connection weights iteratively.<br />
<br />
Boltzmann Machines are often used in generative models. That is, we start with some process seen in real life and try to reproduce it, with a goal of predicting future behaviour of the system by generating from the probability distribution created by the Boltzmann Machine.<br />
<br />
==How a Boltzmann Machine Works==<br />
<br />
Suppose we start with a pattern ɣ that represents some real life dynamical system. The true probability distribution function of this system is f_ɣ. For each element in the vectors associated with this system, we create a visible unit in the Boltzmann Machine whose function is directly related to the value of that element. Then, usually, to capture higher order regularities in the pattern, we create hidden units (similar to Feed-Forward Neural Networks). Sometimes researchers choose not to use hidden units, but this leads to a lack of ability to learn high order regularity [5]. There are two possible values for each node in the Boltzmann Machine: “on” or “off”. There is a difference in energy between these states. Each node must then compute the difference in energy to see which state would be more favourable. This difference is called the “energy gap”. <br />
<br />
Each node of the Boltzmann Machine is presented an opportunity to update its status. When a set of input vectors is shown to the layer, a computation takes place within each node to decide to convert to “on” or to remain “off”. The computation is as follows:<br />
<br />
<math> \Delta E_i = E_{-1} - E_{+1} = \sum_j w_{ij}S_j </math> <br />
<br />
Where <math> w_{ij} </math> represents the weight between nodes i and j, and <math> S_j </math> is the state of the jth component. <br />
<br />
Then the probability that the node will adopt the “on” state is:<br />
<br />
<math> P(+1) = \frac{1}{(1 + exp( \frac{\delta E}{T}))} </math><br />
<br />
Where T is the temperature of the system. The probability of any vector v being an output of the system is just the energy of the vector divided by the total energy of the system, or<br />
<br />
<math> P(v) = \frac{e^{-E(v)}}{e^{E(system)}} </math><br />
<br />
And the energy of a vector is defined as:<br />
<br />
<math> E({v}) = -\sum_i s^{v}_i b_i -\sum_{i<j} s^{v}_i s^{v}_j w_{ij} </math> [1]<br />
<br />
Simulated annealing, a method to improve the search for a global minimum, is being used here. It may not succeed in finding the global minimum on its own [3]. This may be a foreign concept to statisticians. For more information, consult [6] and [7]. The state gets changed to whichever calculation in the logistic function step yields a decrease in energy. <br />
<br />
Eventually, through learning, the Boltzmann Machine will reach an equilibrium state, much like a Markov Chain. This equilibrium state will have a low temperature. Once equilibrium has been reached, we can estimate the probability distribution across the nodes of the Boltzmann Machine. Using this information, we can model how the dynamical system will behave in the long run. <br />
<br />
Since the system is in equilibrium, we can use the mean value of each visible unit to build a probability model. We wouldn’t want to do these calculations before reaching equilibrium, because they would not be representative of the long-term behaviour of the system. Let this measured distribution be denoted f_δ. Then we are interested in measuring the difference between the true distribution and this measured distribution. <br />
<br />
There are several different methods that can be used to compare distributions. One that is commonly used is the relative entropy:<br />
<br />
<math> G(f_\gamma ||f_\delta) = \sum_\gamma f_\gamma ln(\frac{f_\gamma}{f_\delta}) </math> [5]<br />
<br />
We want to minimize this distance, since we want the measured distribution to be as close as possible to the true distribution<br />
<br />
==Learning in Boltzmann Machines==<br />
<br />
The Two-Phase Method<br />
<br />
Boltzmann Machines using hidden units are very robust tools. Visible units are coupled, leading to a problem when trying to capture the effects of higher-dimensional regularities. When hidden units are introduced, the system has the ability to define and use these regularities.<br />
<br />
One approach to learning Boltzmann Machines is discussed thoroughly in [5]. To summarize, this approach makes use of two phases. <br />
<br />
Phase 1: Fix all visible units. Allow the hidden units to change as necessary to obtain equilibrium. Then, look at pairs of units. If two elements of a pair are both “on”, then increment the weight associated with them. So this phase consists entirely of “learning”. There is no control for spurious data.<br />
<br />
Phase 2: No units are fixed. Allow all units to change as necessary to obtain equilibrium. Then sample the final equilibrium distribution to find reliable averages of the term s¬_i s_j. Then as before, look for pairs of units that are both “on”, and decrement the weight associated with them. So this is the phase in which spurious data are eliminated.<br />
<br />
Alternate between these two phases. Eventually, the equilibrium distribution will be reached and we see that <math> “ \frac{\partial {G}}{\partial {w_{ij}}} = \frac{-1}{T} (<s_i s_j>^{+} - <s_i s_j>^{-}) </math> where <math> s_i s_j </math> are the probabilities of finding units i and j both “on”, when the network is ‘fixed’ and ‘free-running’, respectively” [5]. <br />
Another method, for learning Deep Boltzmann Machines, is presented in [2].<br />
<br />
==Pros and Cons of using Boltzmann Machines==<br />
<br />
Pros<br />
<br />
* More accurate than backpropagation [5]<br />
* Bayesian interpretation of how good a model is [5]<br />
<br />
Cons<br />
<br />
* Very slow, because of nested loops necessary to perform phases [5]<br />
<br />
There are many topics on which this discussion could be expanded. For example, we could get into a more in-depth discussion of simulated annealing, or look at Restricted Boltzmann Machines (RBMs) for deep learning, or different methods of learning and different measures of error. Another interesting topic would be a discussion on mean field approximation of Boltzmann Machines, which supposedly runs faster.<br />
<br />
*The time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths<br />
*Connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.<br />
References: <br/><br />
[1] http://www.scholarpedia.org/article/Boltzmann_machine <br/><br />
[2] http://www.mit.edu/~rsalakhu/papers/dbm.pdf <br/><br />
[3] http://mathworld.wolfram.com/SimulatedAnnealing.html <br/><br />
[4] http://waldron.stanford.edu/~jlm/papers/PDP/Volume%201/Chap7_PDP86.pdf <br/><br />
[5] http://cs.nyu.edu/~roweis/notes/boltz.pdf <br/><br />
[6] http://neuron.eng.wayne.edu/tarek/MITbook/chap8/8_3.html <br/><br />
[7] Bertsimas and Tsitsiklis. Simulated Annealing. Statistical Science. 1993. Vol. 8, No. 1, 10 – 15. <br/><br />
<br />
==References==<br />
<references /></div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=45109Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-17T14:31:14Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)log(O(x)) + (1 - I(x))log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm for several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. <br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy.<br />
<br />
==Reference==<br />
Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost&diff=45107IPBoost2020-11-17T14:27:53Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and by now standard technique in classification to combine several “low accuracy” learners, so-called base learners, into a “high accuracy” learner, a so-called boosted learner. Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
# Push weight of the data distribution <math> \mathcal D_t</math> towards misclassified examples leading to <math> \mathcal D_{t+1}</math><br />
<br />
# Evaluate performance of <math> \mu_t</math> by computing its loss.<br />
<br />
# Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however clearly such a learner might not necessarily generalize well. Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper we explore non-convex boosting in classification by means of integer programming and demonstrate real-world practicability of the approach while circumventing shortcomings of convex boosting approaches. The paper reports results that are comparable to or better than the current state-of-the-art.<br />
<br />
== Motivation ==<br />
<br />
In reality we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. However if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data, however that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day this is not solving the fundamental problem.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is reoptimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price applies column generation at every node of the branch and bound tree.<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance on hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvement is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.<br />
<br />
Suffice to say, the approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification&diff=45104Streaming Bayesian Inference for Crowdsourced Classification2020-11-17T14:23:23Z<p>Gtompkin: /* Empirical Analysis */</p>
<hr />
<div>Group 4 Paper Presentation Summary<br />
<br />
By Jonathan Chow, Nyle Dharani, Ildar Nasirov<br />
<br />
== Motivation ==<br />
Crowdsourcing can be a useful tool for data generation in classification projects. Often this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon’s Mechanical Turk.<br />
<br />
The primary limitation with this form of acquiring data is that respondents are liable to submit incorrect responses. This results in datasets which are noisy and unreliable.<br />
<br />
The integrity of the data is then limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings.<br />
<br />
== Dawid-Skene Model for Crowdsourcing ==<br />
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.<br />
<br />
At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math>. We denote responses up to time <math>t</math> via superscript.<br />
<br />
We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correct labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.<br />
<br />
Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = argmin_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center><br />
<br />
Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = argmin_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center><br />
<br />
We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = sign\{\sum_{j \in N_i} x_{ij}\}</math>.<br />
<br />
== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==<br />
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.<br />
<br />
We factor <math>P(y, p | X^t, \theta) \approx \prod_{I \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.<br />
<br />
We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.<br />
<br />
When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by<br />
<br />
<center><math>\nu_j^t(p_j) \sim Beta(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center><br />
<br />
<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center><br />
where <math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.<br />
<br />
We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k=-1,1</math>.<br />
<br />
Depending on our ordering of labels <math>X</math>, we can select for different applications.<br />
<br />
== Fast SBIC ==<br />
The pseudocode for Fast SBIC is shown below.<br />
<br />
<center>[[Image:FastSBIC.png|800px|]]</center><br />
<br />
As the name implies, the goal with this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.<br />
<br />
We express <math>\mu_i^t</math> in terms of its log-odds<br />
<center><math>z_i^t = log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center><br />
where <math>x_i^0 = log(\frac{q}{1 - q})</math>.<br />
<br />
The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,<br />
<br />
<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} sig(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center><br />
where <math>sig(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math><br />
<br />
The final predictions are made by choosing class <math>\hat{y}_i^T = sign(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.<br />
<br />
== Sorted SBIC ==<br />
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.<br />
<br />
<center>[[Image:SortedSBIC.png|800px|]]</center><br />
<br />
From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.<br />
<br />
We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.<br />
<br />
We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks.<br />
<br />
== Theoretical Analysis ==<br />
The authors prove an exponential relationship between the error probability and number of labels per task. The two theorems, for the different sampling regimes, are presented below.<br />
<br />
<center>[[Image:Theorem1.png|800px|]]</center><br />
<br />
<center>[[Image:Theorem2.png|800px|]]</center><br />
<br />
== Empirical Analysis ==<br />
The purpose of the empirical analysis is to compare SBIC to existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, montecarlo sampling, and triangular estimation. <br />
<br />
Firstly, the algorithms are run on synthetic data that meets the assumptions of the underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results oreviously shown. <br />
<br />
<center>[[Image:RealWorldResults.png|800px|]]</center><br />
<br />
In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction error to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.<br />
<br />
The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.<br />
<br />
<center>[[Image:TimeRequirement.png|800px|]]</center><br />
<br />
== Conclusion and Future Research ==<br />
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.<br />
<br />
== Critique ==<br />
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure <math>X</math> is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended.<br />
<br />
== References ==<br />
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=44919stat441F212020-11-16T16:45:27Z<p>Gtompkin: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54] ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] ||<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad, Jihoon Han || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067] || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Deep multiple instance learning for image classification and auto-annotation || [https://jiajunwu.com/papers/dmil_cvpr.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title= Summary] ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Msuhi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 paper] || ||</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44913Adversarial Attacks on Copyright Detection Systems2020-11-16T15:42:08Z<p>Gtompkin: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
The copyright detection system is one of the most commonly used machine learning systems. However, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public and remains largely unstudied. Copyright detection systems are vulnerable to attacks for three reasons;<br />
<br />
1. Unlike physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== Case study: evading audio fingerprinting ==<br />
<br />
=== Audio Fingerprinting Model===<br />
The audio fingerprinting model plays an important role in copyright detection. It is useful for quickly locating or finding similar samples inside an audio database. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three principles: temporally localized, translation invariant, and robustness, the Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
=== Interpreting the fingerprint extractor as a CNN ===<br />
The intention of this section is to build a differentiable neural network whose function resembles that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists of two convolutional layers and a max-pooling layer, which is used for dimension reduction. This is depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
=== Formulating the adversarial loss function ===<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified to compare how similar two fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak is in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the outputs of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
=== Remix adversarial examples===<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations.<br />
<br />
Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). <br />
<br />
By modifying the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, this remix loss function is to make two signal x and y look as similar as possible.<br />
<br />
$$J_{remix}(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_2}{\max}\phi(i+j;x)-\underset{|j| \leq w_1}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
<br />
By adding this new loss function, a new optimization problem could be defined. <br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x) + \lambda J_{remix}(x+\delta,y)\\<br />
s.t.||\delta||_{p}\le\epsilon<br />
$$<br />
<br />
where <math>{\lambda}</math> is a scalar parameter that controls the similarity of <math>{x+\delta}</math> and <math>{y}</math>.<br />
<br />
This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks against our own proposed model is used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== Critiques ==<br />
- The experiments in this paper appear like to be a proof-of-concept rather than a serious evaluation of a model. One problem is that the norm is used to evaluate the perturbation. Unlike the norm in image domains which can be visualized and easily understood, the perturbations in the audio domain are a more difficult to comprehend. A cognitive study or something like a user study might need be conducted in order to understand this. Another question related to this is that if the random noise is 2x bigger or 3x bigger in terms of norm, does this make huge difference when listening to it? Are these two perturbations both very obvious or both unnoticeable? In addition, it seems that a dataset is built but the stats are missing. Third, no baseline methods are being compared to in this paper, not even an ablation study. The proposed two methods (default and remix) seem to perform similarly.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes&diff=44912Influenza Forecasting Framework based on Gaussian Processes2020-11-16T15:39:26Z<p>Gtompkin: /* Background */</p>
<hr />
<div><br />
== Abstract ==<br />
<br />
This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Resulting retrospective forecasts, trained on a subset of the publicly available CDC influenza-like-illness (ILI) data-set, outperformed four state-of-the-art models when compared using the official CDC scoring rule (log-score).<br />
<br />
== Background ==<br />
<br />
Each year, the seasonal influenza epidemic affects public health at a massive scale, resulting in 38 million cases, 400 000 hospitalizations, and 22 000 deaths in the United States in 2019/20 alone [1]. Given this, reliable forecasts of future influenza development are invaluable, because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.<br />
<br />
== Related Work ==<br />
<br />
Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has lead to four state of the art models in the field; MSS, a physical susceptible-infected-recovered model with assumed linear noise [4]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.<br />
<br />
== Motivation ==<br />
<br />
It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework, and the subsequent performance comparison to the benchmark models.<br />
<br />
== Gaussian Process Regression ==<br />
<br />
Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.<br />
<br />
<br />
Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have <br />
<br />
$$<br />
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))<br />
$$<br />
<br />
where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form <br />
<br />
$$<br />
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}^j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).<br />
$$<br />
<br />
It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves). <br />
<br />
By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form<br />
<br />
$$<br />
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))<br />
$$<br />
<br />
<br />
<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div><br />
<br />
<div align="center">'''Figure 1. Gaussian process regression''': Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [3]. </div><br />
<br />
== Data-set ==<br />
<br />
Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains $d_j^i$ for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).<br />
<br />
To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.<br />
<br />
== Proposed Framework ==<br />
<br />
To compute <math>\hat{y}</math>, the following algorithm is executed. <br />
<br />
<ol><br />
<br />
<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).<br />
<br />
<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> <br />
<br />
<li> Train the Gaussian process<br />
<br />
<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math><br />
<br />
<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math><br />
<br />
<li> Repeat steps 2-5 for all sets of weeks <math>J</math><br />
<br />
<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets)<br />
<br />
<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math><br />
<br />
</ol><br />
<br />
== Results ==<br />
<br />
To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. In other words, the Gaussian process model was trained assuming a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome. <br />
<br />
To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.<br />
<br />
<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div><br />
<br />
<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''': One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div><br />
<br />
<br />
Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within in an interval around the true value. <br />
<br />
<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div><br />
<br />
<div align="center">'''Figure 3. Average log-score gain of proposed framework''': Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.<br />
<br />
== Critique ==<br />
<br />
The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular; one can compute key parameters such as the peak infection incidence (<math>\hat{y} = max_{0 \leq j \leq 52} d^i_j </math>), the timing of the peak infection incidence (<math>\hat{y} = argmax_{0 \leq j \leq 52} d^i_j</math>) and the final epidemic size of a season (<math>\hat{y} = \sum_{j=1}^{52} d^i_j</math>). However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.<br />
<br />
== References ==<br />
<br />
[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html<br />
<br />
[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.<br />
<br />
[3] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.<br />
<br />
[4] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs&diff=44538Neural ODEs2020-11-15T15:42:49Z<p>Gtompkin: /* Scope and Limitations */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that the transformation given in Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section 5 of how the continuous, neural ODE network, outperforms the discretized version i.e. residual networks. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function in an efficient manner. In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method provides very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar valued loss function <math>L</math>, that takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of <math>\mathbf{z}(t)</math>. Luckily, both <math>\mathbf{a}</math> and <math>\mathbf{z}</math> can be calculated at the same time by setting up an augmented version of the dynamics, as seen in Algorithm 1. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain calculations of the derivatives in equation 3 and 4, automatic differentiation is used to incur a very small time cost. The authors provide an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a dependence on intermediate times <math>t_i, i \in (0,N)</math> then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> for different times. A visual interpretation of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used the Adams method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, the problem of backpropagation through the solver still exists and the adjoint sensitivity method must be used in order to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the Adams ODESolve. As a hybrid of the two models, ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backwards Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the Adams method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks, is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the colour bar in Figure 3 above. Notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples this ratio is 1:1.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow log(p(z_1))=log(p(z_0))-log|det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalising flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam. The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrates the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math>, and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}=ODESolve(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
log(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^Nlog(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise, and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of datapoints.<br />
<br />
== Scope and Limitations ==<br />
<br />
Section 6 mainly discusses the scope and limitations of the paper. Firstly while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error in this case may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, then Picard’s existence theorem (Coddington and Levinson, 1955) applies, guaranteeing the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class. <br />
<br />
In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>1e-3</math> and <math>1e-5</math> in classification and density estimation respectively without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backwards can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
<br />
<br />
== Link to Appendices of Paper == <br />
https://arxiv.org/pdf/1806.07366.pdf<br />
<br />
== References ==</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=44531what game are we playing2020-11-15T15:38:34Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
In recent times, there have been many different studies of methods for using AI to solve large-scale, zero-sum, extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, and the objective is just finding the optimal strategy to the game. This scenario proves to be unrealistic in real-world scenarios where there are often times when one must find the optimal strategy while the parameters of game are unknown. This paper proposes a framework for finding an optimal solution using a primal-dual Newton Method, and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
[[File:Framework.png ]]<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defence game.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict and discontinuous with respect to P, and NEs may not be unique. Thus, it is more common to model the players' actions with the '''quantal response equilibria''' (QRE) to address these issues. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "'''Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix we can instead represent it as a simple decision tree (for a ''single'' drawn card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance to what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of an information set for player 1 is the set of all of nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King) and all 13 of these nodes are indistinguishable to player 1, and so form an information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> is size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minmax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from an information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the minmax problem will have a unique solution and ''the objective function is continuous and continuously differntiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers my be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minmax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance to the QRE.<br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell 1) both parameters learned and predicted strategies improve with larger dataset; and 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. In this case, they needed to know the player’s perceived or believed distribution of cards may be different from the distribution of cards dealt. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game setting by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
Unsurprisingly, the results of this study show that in general the quality of learned parameters improved as the number of observations increased. The network presented in this paper demonstrated improvement over existing methodology. <br />
<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed, and extensions to the area of learning general-sum games.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=44529what game are we playing2020-11-15T15:36:41Z<p>Gtompkin: /* Zero-Sum Normal Form Games */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
In recent times, there have been many different studies of methods for using AI to solve large-scale, zero-sum, extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, and the objective is just finding the optimal strategy to the game. This scenario proves to be unrealistic in real-world scenarios where there are often times when one must find the optimal strategy while the parameters of game are unknown. This paper proposes a framework for finding an optimal solution using a primal-dual Newton Method, and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
[[File:Framework.png ]]<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defence game.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict and discontinuous with respect to P, and NEs may not be unique. Thus, it is more common to model the players' actions with the '''quantal response equilibria''' (QRE) to address these issues. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "'''Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix we can instead represent it as a simple decision tree (for a ''single'' drawn card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance to what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of an information set for player 1 is the set of all of nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King) and all 13 of these nodes are indistinguishable to player 1, and so form an information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> is size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minmax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from an information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the minmax problem will have a unique solution and ''the objective function is continuous and continuously differntiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers my be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minmax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance to the QRE.<br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell 1) both parameters learned and predicted strategies improve with larger dataset; and 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. In this case, they needed to know the player’s perceived or believed distribution of cards may be different from the distribution of cards dealt. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game setting by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed, and extensions to the area of learning general-sum games.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=44524what game are we playing2020-11-15T15:33:21Z<p>Gtompkin: /* Introduction */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
In recent times, there have been many different studies of methods for using AI to solve large-scale, zero-sum, extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, and the objective is just finding the optimal strategy to the game. This scenario proves to be unrealistic in real-world scenarios where there are often times when one must find the optimal strategy while the parameters of game are unknown. This paper proposes a framework for finding an optimal solution using a primal-dual Newton Method, and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
[[File:Framework.png ]]<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defence game.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' P that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict and discontinuous with respect to P, and NEs may not be unique. Thus, it is more common to model the players' actions with the '''quantal response equilibria''' (QRE) to address these issues. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "'''Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix we can instead represent it as a simple decision tree (for a ''single'' drawn card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance to what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of an information set for player 1 is the set of all of nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King) and all 13 of these nodes are indistinguishable to player 1, and so form an information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> is size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minmax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from an information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the minmax problem will have a unique solution and ''the objective function is continuous and continuously differntiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers my be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minmax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance to the QRE.<br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell 1) both parameters learned and predicted strategies improve with larger dataset; and 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. In this case, they needed to know the player’s perceived or believed distribution of cards may be different from the distribution of cards dealt. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game setting by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed, and extensions to the area of learning general-sum games.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=43792User:J46hou2020-11-11T14:38:49Z<p>Gtompkin: </p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Another use case would be in recognizing “wake-word” in waking up AI systems such as Alexa. In this work, we are presenting a new approach called Deep Robust One Class Classification (DROCC). DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data.<br />
<br />
== Previous Work ==<br />
The current state of art methodology to tackle this kind of problems are: <br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some short coming in terms of it heavily depends on an appropriate domain-specific set of transformations that are hard to obtain in general. <br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations. <br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the inputs (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. While these techniques are well-suited when the input is featurized appropriately, they struggle on complex domains like vision and speech, where hand-designing features is difficult.<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and is empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a) A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b) Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): first two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lines on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7] We propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. But, in addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all the three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of the Digit 0, the normal class, and the Digit 1 as the anomaly. During evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as: Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold, and hence can compare close point via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set for both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques. <br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman&diff=43707User:Bsharman2020-11-10T20:30:02Z<p>Gtompkin: </p>
<hr />
<div>'''Risk prediction in life insurance industry using supervised learning algorithms'''<br />
<br />
'''Presented By'''<br />
<br />
Bharat Sharman, Dylan Li, Leonie Lu, Mingdao Li<br />
<br />
'''Introduction'''<br />
<br />
----<br />
<br />
Risk assessment lies at the core of the Life Insurance Industry. It is extremely important for a Life Insurance Company to assess the risk of an application accurately in order to make sure that applications with an actual low risk are accepted and an actual high risk are rejected. Otherwise, individuals with an unacceptably high risk profile will be issued policies and when they pass away, the company will face large losses due to high insurance payouts. Such a situation is called ‘Adverse Selection’, where individuals who are most likely to suffer losses take insurance and those who are not likely to suffer losses do not and thus, the company suffers losses as a result.<br />
<br />
Traditionally, the process of Underwriting (deciding whether or not to insure the life of an individual) has been done using Actuarial calculations. Actuaries group customers according to their estimated levels of risk determined from historical data. (Cummins J, 2013) However, these conventional techniques are time consuming and it is not uncommon to take a month to issue a policy. They are expensive as a lot of manual processes need to be executed. <br />
<br />
Predictive Analysis has emerged as a useful technique to streamline the underwriting process to reduce the time of Policy issuance and to improve the accuracy of risk prediction. In this paper, the authors use data from Prudential Life Insurance company and investigate the most appropriate data extraction method and the most appropriate algorithm to assess risk. <br />
<br />
'''Literature Review'''<br />
<br />
----<br />
<br />
<br />
Before a Life Insurance company issues a policy, it must execute a series of underwriting related tasks (Mishr, 2016). These tasks involve gathering extensive information about the applicant. The insurer has to analyze the employment, medical, family and insurance histories of the applicant and factor all of them into a complicated series of calculations to determine the risk rating of the applicant. On basis of this risk rating, premiums are calculated (Prince, 2016).<br />
<br />
In a competitive marketplace, customers need policies to be issued quickly and long wait times can lead to them switch to other providers (Chen 2016). In addition, the costs of doing the data gathering and analysis can be expensive. The insurance company bears the expenses of the medical examinations and if a policy lapses, then the insurer has to bear the losses of all these costs (J Carson, 2017). If the underwriting process uses Predictive Analytics, then the costs and time associated with many of these processes can be reduced via streamlining. <br />
<br />
'''Methods and Techniques'''<br />
<br />
----<br />
<br />
<br />
In Figure 1, the process flow of the analytics approach has been depicted. These stages will now be described in the following sections.<br />
<br />
[[File:Data_Analytics_Process_Flow.PNG]]<br />
<br />
'''Description of the Dataset'''<br />
<br />
----<br />
<br />
<br />
The data is obtained from the Kaggle competition hosted by the Prudential Life Insurance company. It has 59381 applications with 128 attributes. The attributes are continuous and discrete as well as categorical variables. <br />
The data attributes, their types and the description is shown in Table 1 below:<br />
<br />
[[File:Data Attributes Types and Description.png]]<br />
<br />
'''Data Pre-Processing'''<br />
<br />
----<br />
<br />
<br />
In the data preprocessing step, missing values in the data are either imputed or those entries are dropped and some of the attributes are either transformed in a different form to make the subsequent processing of data easier. This decision is made after determining the mechanism of missingness, that is if the data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR). <br />
<br />
'''Dimensionality Reduction''' <br />
<br />
In this paper, there are two methods that have been used for dimensionality reduction – <br />
<br />
1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based measures such a Pearson’s coefficient, minimum description length, symmetrical uncertainty and relief. <br />
<br />
2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data. <br />
<br />
<br />
'''Supervised Learning Algorithms'''<br />
<br />
----<br />
<br />
<br />
The four Algorithms that have been used in this paper are the following:<br />
<br />
1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors. The significance of the variables is determined by tests like the F test and the p-values. <br />
<br />
2.REPTree: REPTree stands for reduced error pruning tree. It can build both classification and regression trees, depending on the type of the response variable. In this case, it uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree. <br />
<br />
3.Random Tree (Also known as the Random Forest): A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.<br />
<br />
4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received by it via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.<br />
<br />
'''Experiments and Results'''<br />
<br />
----<br />
<br />
<br />
'''Missing Data Mechanism'''<br />
<br />
Attributes where more than 30% of Data was missing were dropped from the analysis. The data was tested for Missing Completely at Random (MCAR), one form of the nature of missing values using the Little Test. The null Hypothesis that the missing data was completely random had a p value of 0 meaning, MCAR was rejected. Then, all the variables were plotted to check how many missing values that they had and the results are shown in the figure below:<br />
<br />
[[File:Missing Value Plot of Training Data.png]]<br />
<br />
The variables that have the most number of missing variables are plotted at the top and that have the least number of missing variables are plotted at the bottom of the y-axis in the figure above. There does not seem to a pattern to the missing variables and therefore they are assumed to be Missing at Random (MAR), meaning the tendency for the variables to be missing is not related to the missing data, but it is related to the observed data.<br />
<br />
'''Missing Data Imputation'''<br />
<br />
Assuming that missing data follows an MAR pattern, multiple imputation is used as a technique to fill in the values of missing data. The steps involved in Multiple Imputation are the following: <br />
<br />
Imputation: Imputation of the missing values is done over several steps and this results in a number of complete data sets. Imputation is done via a predictive model like linear regression to predict these missing values based on other variables in the data set.<br />
<br />
Analysis: The complete data sets that are formed are analyzed and parameter estimates and standard errors are calculated.<br />
<br />
Pooling: The analysis results are then integrated to form a final data set that is then used for further analysis.<br />
<br />
'''Comparison of Feature Selection and Feature Extraction'''<br />
<br />
The Correlation based Feature Selection (CFS) method was performed using the Waikato Environment for Knowledge Analysis. It was implemented using a BestFirst search method on a CfsSubsetEval attribute evaluator. 33 variables were selected out the total of 117 features. <br />
PCA was implemented via a RankerSearch Method using a Principal Components Attributes Evaluator. Out of the 117 features, those that had a standard deviation of more than 0.5 times the standard deviation of the first principal component were selected and this resulted in 20 features for further analysis. <br />
After dimensionality reduction, this reduced data set was exported and used for building prediction models using the four machine learning algorithms discussed before – REPTree, Multiple Linear Regression, Random Tree and ANNs. The results are shown in the Table below: <br />
<br />
[[File:Comparison of Results between CFS and PCA.png]]<br />
<br />
For CFS, the REPTree model had the lowest MAE and RMSE. For PCA, Multiple Linear Regression Model had the lowest MAE as well as RMSE. So, for this dataset, it seems that overall, Multiple Linear Regression and REPTree Models are the two best ones in terms of lowest error rates. In terms of dimensionality reduction, it seems that CFS is a better method than PCA for this data set as the MAE and RMSE values are lower for all ML methods except ANNs.<br />
<br />
'''Conclusion and Further Work'''<br />
<br />
----<br />
<br />
<br />
Predictive Analytics in the Life Insurance Industry is enabling faster customer service and lower costs by helping automate the process of Underwriting. <br />
In this study, the authors analyzed data obtained from Prudential Life Insurance to predict risk scores via Supervised Machine Learning Algorithms. The data was first pre-processed to first replace the missing values. Attributes having more than 30% of missing data were eliminated from analysis. <br />
Two methods of dimensionality reduction – CFS and PCA were used and the number of attributes used for further analysis were reduced to 33 and 20 via these two methods. The Machine Learning Algorithms that were implemented were – REPTree, Random Tree, Multiple Linear Regression and Artificial Neural Networks. Model validation was performed via a ten-fold cross validation. The performance of the models was evaluated using MAE and RMSE measures. <br />
Using the PCA method, Multiple Linear Regression showed the best results with MAE and RMSE values of 1.64 and 2.06 respectively. With CFS, REPTree had the highest accuracy with MAE and RMSE values of 1.52 and 2.02 respectively. <br />
Further work can be directed towards dealing all the variables rather than deleting the ones where more than 30% of the values are missing. Customer segmentation, i.e. grouping customers based on their profiles can help companies come up with customized policy for each group. This can be done via unsupervised algorithms like clustering. Work can also be done to make the models more explainable especially if we are using PCA and ANNs to analyze data. We can also get indirect data about the prospective applicant like their driving behavior, education record etc to see if these attributes contribute to better risk profiling than the already available data.<br />
<br />
<br />
'''Critiques'''<br />
<br />
----<br />
Since the project built multiple models and had utilized various methods to evaluate the result. They could potentially ensemble the prediction, such as averaging the result of the different models, to achieve a better accuracy result. Another method is model stacking, we can input the result of one model as input into another model for better results. However, they do have some major setbacks: sometimes, the result could be effect negatively (ie: increase the RMSE). In addition, if the improvement is not prominent, it would make the process much more complex thus cost time and effort. In a research setting, stacking and ensembling are definitely worth a try. In a real-life business case, it is more of a trade-off between accuracy and effort/cost. <br />
<br />
<br />
'''References'''<br />
<br />
----<br />
<br />
<br />
Chen, T. (2016). Corporate reputation and financial performance of Life Insurers. Geneva Papers Risk Insur Issues Pract, 378-397.<br />
<br />
Cummins J, S. B. (2013). Risk classification in Life Insurance. Springer 1st Edition.<br />
<br />
J Carson, C. E. (2017). Sunk costs and screening: two-part tariffs in life insurance. SSRN Electron J, 1-26.<br />
<br />
Jayabalan, N. B. (2018). Risk prediction in life insurance industry using supervised learning algorithms. Complex & Intelligent Systems, 145-154.<br />
<br />
Mishr, K. (2016). Fundamentals of life insurance theories and applications. PHI Learning Pvt Ltd.<br />
<br />
Prince, A. (2016). Tantamount to fraud? Exploring non-disclosure of genetic information in life insurance applications as grounds for policy recession. Health Matrix, 255-307.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=43570Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-09T18:21:05Z<p>Gtompkin: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease, which is the leading cause of death worldwide, from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake, in the area of scientific machine learning. A deep learning approach was used due to the model’s ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. The benefits of translating knowledge between deep learning and it's real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbours, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbours, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbours as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1) Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2) random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labelled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. Neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and cardiology community can work together to develop deep learning algorithms that provides trustworthy, real-time information regarding heart conditions with minimal computational resources.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=43569Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-09T18:17:09Z<p>Gtompkin: /* Introduction */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease, which is the leading cause of death worldwide, from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake, in the area of scientific machine learning. A deep learning approach was used due to the model’s ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. The benefits of translating knowledge between deep learning and it's real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbours, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbours, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbours as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set and test set with ratios 80-10-10.<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1) Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2) random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labelled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. Neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and cardiology community can work together to develop deep learning algorithms that provides trustworthy, real-time information regarding heart conditions with minimal computational resources.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=43567Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-09T18:14:40Z<p>Gtompkin: /* Introduction */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease, which is the leading cause of death worldwide, from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake, in the area of scientific machine learning. A deep learning approach was used due to the model’s ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbours, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbours, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbours as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set and test set with ratios 80-10-10.<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1) Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2) random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labelled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. Neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and cardiology community can work together to develop deep learning algorithms that provides trustworthy, real-time information regarding heart conditions with minimal computational resources.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=43566Neural Speed Reading via Skim-RNN2020-11-09T18:12:56Z<p>Gtompkin: </p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, a RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that is irrelevant when it comes to answering a query that is being asked. <br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas which will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d \ll d'</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state, while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss o ver the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and the computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset, and simple enough to run well-controlled analyses for the Skim-RNN. The second model wan an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps , and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important at the second layer. Second, higher <math>\gamma</math> values lead to higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, are used estimate the speed up of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework, or more parallelization. The gap also decreases as the hidden state size increases because the the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provide greater benefits for RNNs with larger hidden state size.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were not minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important thus can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art nn models.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43321User:Gtompkin2020-11-04T15:27:09Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
The model was applied on three types of algorithms: an Autoencoder (AE), a multilayer perceptron and a radial basis function network.<br />
<br />
'''Autoencoder'''<br />
<br />
Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The auto encoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask. <br />
<br />
Popular imputation techniques were compared against the conducted experiment:<br />
<br />
''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5. <br />
<br />
''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.<br />
<br />
''dropout:'' Dropped input neutrons with missing values. <br />
<br />
Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on whole area and area outside the mask. <br />
<br />
'''Multilayer Perceptron'''<br />
<br />
<br />
<br />
<br />
<br />
'''Radial Basis Function Network'''<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.<br />
<br />
== Critiques ==<br />
<br />
A simulation study where the mechanism of missingness was known would have been interesting to examine. Doing this would allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43257User:Gtompkin2020-11-03T13:43:37Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.<br />
<br />
== Critiques ==<br />
<br />
A simulation study where the mechanism of missingness was known would have been interesting to examine. Doing this would allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43256User:Gtompkin2020-11-03T13:36:48Z<p>Gtompkin: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods.<br />
<br />
== Critiques ==<br />
<br />
A simulation study where the mechanism of missingness was known would have been interesting to examine. Doing this would allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43255User:Gtompkin2020-11-03T13:36:03Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods.<br />
<br />
== Critiques ==<br />
<br />
As multiple imputation is one of the standard methods for imputation (and tends to perform better than mean imputation), a comparison of performing multiple imputation on the dataset prior to entering a neural network with the method presented in the paper may have been beneficial.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43254User:Gtompkin2020-11-03T13:32:49Z<p>Gtompkin: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
As multiple imputation is one of the standard methods for imputation (and tends to perform better than mean imputation), a comparison of performing multiple imputation on the dataset prior to entering a neural network with the method presented in the paper may have been beneficial.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=43166stat441F212020-11-02T19:16:14Z<p>Gtompkin: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad, Jihoon Han || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Cynthia Mou, Sangeeth Kalaichanthiran || 8|| Uniform convergence may be unable to explain generalization in deep learning || [https://papers.nips.cc/paper/9336-uniform-convergence-may-be-unable-to-explain-generalization-in-deep-learning.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network || [http://people.cs.uchicago.edu/~pworah/rmt2.pdf] || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Deep multiple instance learning for image classification and auto-annotation || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Wu_Deep_Multiple_Instance_2015_CVPR_paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Msuhi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 paper] || ||</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=43165stat441F212020-11-02T19:15:46Z<p>Gtompkin: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin] ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad, Jihoon Han || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Cynthia Mou, Sangeeth Kalaichanthiran || 8|| Uniform convergence may be unable to explain generalization in deep learning || [https://papers.nips.cc/paper/9336-uniform-convergence-may-be-unable-to-explain-generalization-in-deep-learning.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network || [http://people.cs.uchicago.edu/~pworah/rmt2.pdf] || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Deep multiple instance learning for image classification and auto-annotation || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Wu_Deep_Multiple_Instance_2015_CVPR_paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Msuhi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 paper] || ||</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43164User:Gtompkin2020-11-02T19:00:25Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43157User:Gtompkin2020-11-02T18:10:12Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43156User:Gtompkin2020-11-02T18:06:06Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</center><br />
then <math>\nu = \mu</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}</math>. The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43155User:Gtompkin2020-11-02T18:03:28Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <math>n(\mu) := \int n(x)d\mu(x)</math>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R} \text{ then } v = \mu</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}</math>. The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <math>exp(ir) = cos(r) + sin(r) \in F_w</math> and we have the equality <math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x)</math>. Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43147User:Gtompkin2020-11-02T17:47:01Z<p>Gtompkin: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
This generalization can be applied to both the ReLU and RBF neurons, and two theorems are proposed that describe how to apply this generalization to both neurons. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}})</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_w,b(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <math>n(\mu) := \int n(x)d\mu(x)</math>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
'''Theorem 4.1.''' Let <math>\mu, v<\math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math><br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43115User:Gtompkin2020-11-02T16:33:31Z<p>Gtompkin: /* Related Work */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43114User:Gtompkin2020-11-02T16:31:16Z<p>Gtompkin: /* References */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data include a paper by Bengio and Gringras [1] and Goodfellow et. al. [2].<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43113User:Gtompkin2020-11-02T16:31:08Z<p>Gtompkin: /* Related Work */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data include a paper by Bengio and Gringras [1] and Goodfellow et. al. [2].<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43112User:Gtompkin2020-11-02T16:30:12Z<p>Gtompkin: /* References */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data include a paper by Bengio and Gringras<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43111User:Gtompkin2020-11-02T16:30:03Z<p>Gtompkin: /* Related Work */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Previous work using neural networks for missing data include a paper by Bengio and Gringras<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43110User:Gtompkin2020-11-02T16:18:22Z<p>Gtompkin: </p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work == <br />
<br />
<br />
<br />
== Layer for Processing Missing Data == <br />
<br />
<br />
== Theoretical Analysis ==<br />
<br />
== Experimental Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43109User:Gtompkin2020-11-02T16:15:49Z<p>Gtompkin: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning in data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Previous Work == <br />
<br />
<br />
<br />
== Motivation == <br />
<br />
<br />
== Theoretical Results ==<br />
<br />
== Simulation Study ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=43105stat441F212020-11-02T15:38:48Z<p>Gtompkin: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin] ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad, Jihoon Han || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Cynthia Mou, Sangeeth Kalaichanthiran || 8|| Uniform convergence may be unable to explain generalization in deep learning || [https://papers.nips.cc/paper/9336-uniform-convergence-may-be-unable-to-explain-generalization-in-deep-learning.pdf] || ||<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network || [http://people.cs.uchicago.edu/~pworah/rmt2.pdf] || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Deep multiple instance learning for image classification and auto-annotation || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Wu_Deep_Multiple_Instance_2015_CVPR_paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Msuhi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 paper] || ||</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=43104User:Gtompkin2020-11-02T15:35:24Z<p>Gtompkin: Created page with "== Presented by == Grace Tompkins, Tatiana Krikella, Swaleh Hussain == Introduction == == Previous Work == == Motivation == == Theoretical Results == == Simulatio..."</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction == <br />
<br />
<br />
== Previous Work == <br />
<br />
<br />
<br />
== Motivation == <br />
<br />
<br />
== Theoretical Results ==<br />
<br />
== Simulation Study ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=43102Neural Speed Reading via Skim-RNN2020-11-02T15:31:55Z<p>Gtompkin: /* Conclusion */</p>
<hr />
<div>== Group ==<br />
<br />
Leyan Cheng, Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, a RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that is irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text <ref>Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.</ref>, it greatly reduces the amount of time spent reading by not focusing on areas which will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d \ll d'</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
This model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating point operations to process the token.<br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state, while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss o ver the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and the computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset, and simple enough to run well-controlled analyses for the Skim-RNN. The second model wan an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps , and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important at the second layer. Second, higher <math>\gamma</math> values lead to higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, are used estimate the speed up of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework, or more parallelization. The gap also decreases as the hidden state size increases because the the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provide greater benefits for RNNs with larger hidden state size.<br />
<br />
==== Latency ====<br />
<br />
A modern GPU has much higher throughput than a CPU with parallel processing. However, for small networks, the CPU often has lower latency than the GPU. Comparing between NumPy with<br />
CPU and TensorFlow with GPU (Titan X), we observe that the former has 1.5 times lower latency (75 microseconds vs 110 microseconds per token) for LSTM of <math>d = 100</math>. This means that combining Skim-RNN with CPU-based framework can lead to substantially lower latency than GPUs. For instance, Skim-RNN with CPU on IMDb has 4.5x lower latency than a GPU, requiring only 29 microseconds per token on average.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were not minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important thus can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art nn models.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman&diff=43101User:Bsharman2020-11-02T15:26:52Z<p>Gtompkin: </p>
<hr />
<div>'''Risk prediction in life insurance industry using supervised learning algorithms'''<br />
<br />
'''Presented By'''<br />
<br />
Bharat Sharman, Dylan Li, Leonie Lu, Mingdao Li<br />
<br />
'''Introduction'''<br />
<br />
----<br />
<br />
Risk assessment lies at the core of the Life Insurance Industry. It is extremely important for a Life Insurance Company to assess the risk of an application accurately in order to make sure that applications with an actual low risk are accepted and an actual high risk are rejected. Otherwise, individuals with an unacceptably high risk profile will be issued policies and when they pass away, the company will face large losses due to high insurance payouts. Such a situation is called ‘Adverse Selection’, where individuals who are most likely to suffer losses take insurance and those who are not likely to suffer losses do not and thus, the company suffers losses as a result.<br />
<br />
Traditionally, the process of Underwriting (deciding whether or not to insure the life of an individual) has been done using Actuarial calculations. Actuaries group customers according to their estimated levels of risk determined from historical data. (Cummins J, 2013) However, these conventional techniques are time consuming and it is not uncommon to take a month to issue a policy. They are expensive as a lot of manual processes need to be executed. <br />
<br />
Predictive Analysis has emerged as a useful technique to streamline the underwriting process to reduce the time of Policy issuance and to improve the accuracy of risk prediction. In this paper, the authors use data from Prudential Life Insurance company and investigate the most appropriate data extraction method and the most appropriate algorithm to assess risk. <br />
<br />
'''Literature Review'''<br />
<br />
----<br />
<br />
<br />
Before a Life Insurance company issues a policy, it must execute a series of underwriting related tasks. (Mishr, 2016)These tasks involve gathering extensive information about the applicant. The insurer has to analyze the employment, medical, family and insurance histories of the applicant and factor all of them into a complicated series of calculations to determine the risk rating of the applicant. On basis of this risk rating, premiums are calculated. (Prince, 2016)<br />
<br />
In a competitive marketplace, customers need policies to be issued quickly and long wait times can lead to them switch to other providers. (Chen, 2016). In addition, the costs of doing the data gathering and analysis can be expensive. The insurance company bears the expenses of the medical examinations and if a policy lapses, then the insurer has to bear the losses of all these costs. (J Carson, 2017). If the underwriting process uses Predictive Analytics, then the costs and time associated with many of these processes can be reduced via streamlining. <br />
<br />
'''Methods and Techniques'''<br />
<br />
----<br />
<br />
<br />
In Figure 1, the process flow of the analytics approach has been depicted. These stages will now be described in the following sections.<br />
<br />
[[File:Data_Analytics_Process_Flow.PNG]]<br />
<br />
'''Description of the Dataset'''<br />
<br />
----<br />
<br />
<br />
The data is obtained from the Kaggle competition hosted by the Prudential Life Insurance company. It has 59381 applications with 128 attributes. The attributes are continuous and discrete as well as categorical variables. <br />
The data attributes, their types and the description is shown in Table 1 below:<br />
<br />
[[File:Data Attributes Types and Description.png]]<br />
<br />
'''Data Pre-Processing'''<br />
<br />
----<br />
<br />
<br />
In the data preprocessing step, missing values in the data are either imputed or those entries are dropped and some of the attributes are either transformed in a different form to make the subsequent processing of data easier. This decision is made after determining the mechanism of missingness, that is if the data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR). <br />
<br />
'''Dimensionality Reduction''' <br />
<br />
In this paper, there are two methods that have been used for dimensionality reduction – <br />
<br />
1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based measures such a Pearson’s coefficient, minimum description length, symmetrical uncertainty and relief. <br />
<br />
2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data. <br />
<br />
<br />
'''Supervised Learning Algorithms'''<br />
<br />
----<br />
<br />
<br />
The four Algorithms that have been used in this paper are the following:<br />
<br />
1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors. The significance of the variables is determined by tests like the F test and the p-values. <br />
<br />
2.REPTree: REPTree stands for reduced error pruning tree. It can build both classification and regression trees, depending on the type of the response variable. In this case, it uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree. <br />
<br />
3.Random Tree (Also known as the Random Forest): A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.<br />
<br />
4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received by it via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.<br />
<br />
'''Experiments and Results'''<br />
<br />
----<br />
<br />
<br />
'''Missing Data Mechanism'''<br />
<br />
Attributes where more than 30% of Data was missing were dropped from the analysis. The data was tested for Missing Completely at Random (MCAR), one form of the nature of missing values using the Little Test. The null Hypothesis that the missing data was completely random had a p value of 0 meaning, MCAR was rejected. Then, all the variables were plotted to check how many missing values that they had and the results are shown in the figure below:<br />
<br />
[[File:Missing Value Plot of Training Data.png]]<br />
<br />
The variables that have the most number of missing variables are plotted at the top and that have the least number of missing variables are plotted at the bottom of the y-axis in the figure above. There does not seem to a pattern to the missing variables and therefore they are assumed to be Missing at Random (MAR). <br />
<br />
'''Missing Data Imputation'''<br />
<br />
Assuming that missing data follows an MAR pattern, multiple imputation is used as a technique to fill in the values of missing data. The steps involved in Multiple Imputation are the following: <br />
<br />
Imputation: Imputation of the missing values is done over several steps and this results in a number of complete data sets. Imputation is done via a predictive model like linear regression to predict these missing values based on other variables in the data set.<br />
<br />
Analysis: The complete data sets that are formed are analyzed and parameter estimates and standard errors are calculated.<br />
<br />
Pooling: The analysis results are then integrated to form a final data set that is then used for further analysis.<br />
<br />
'''Comparison of Feature Selection and Feature Extraction'''<br />
<br />
The Correlation based Feature Selection (CFS) method was performed using the Waikato Environment for Knowledge Analysis. It was implemented using a BestFirst search method on a CfsSubsetEval attribute evaluator. 33 variables were selected out the total of 117 features. <br />
PCA was implemented via a RankerSearch Method using a Principal Components Attributes Evaluator. Out of the 117 features, those that had a standard deviation of more than 0.5 times the standard deviation of the first principal component were selected and this resulted in 20 features for further analysis. <br />
After dimensionality reduction, this reduced data set was exported and used for building prediction models using the four machine learning algorithms discussed before – REPTree, Multiple Linear Regression, Random Tree and ANNs. The results are shown in the Table below: <br />
<br />
[[File:Comparison of Results between CFS and PCA.png]]<br />
<br />
For CFS, the REPTree model had the lowest MAE and RMSE. For PCA, Multiple Linear Regression Model had the lowest MAE as well as RMSE. So, for this dataset, it seems that overall, Multiple Linear Regression and REPTree Models are the two best ones in terms of lowest error rates. In terms of dimensionality reduction, it seems that CFS is a better method than PCA for this data set as the MAE and RMSE values are lower for all ML methods except ANNs.<br />
<br />
'''Conclusion and Further Work'''<br />
<br />
----<br />
<br />
<br />
Predictive Analytics in the Life Insurance Industry is enabling faster customer service and lower costs by helping automate the process of Underwriting. <br />
In this study, the authors analyzed data obtained from Prudential Life Insurance to predict risk scores via Supervised Machine Learning Algorithms. The data was first pre-processed to first replace the missing values. Attributes having more than 30% of missing data were eliminated from analysis. <br />
Two methods of dimensionality reduction – CFS and PCA were used and the number of attributes used for further analysis were reduced to 33 and 20 via these two methods. The Machine Learning Algorithms that were implemented were – REPTree, Random Tree, Multiple Linear Regression and Artificial Neural Networks. Model validation was performed via a ten-fold cross validation. The performance of the models was evaluated using MAE and RMSE measures. <br />
Using the PCA method, Multiple Linear Regression showed the best results with MAE and RMSE values of 1.64 and 2.06 respectively. With CFS, REPTree had the highest accuracy with MAE and RMSE values of 1.52 and 2.02 respectively. <br />
Further work can be directed towards dealing all the variables rather than deleting the ones where more than 30% of the values are missing. Customer segmentation, i.e. grouping customers based on their profiles can help companies come up with customized policy for each group. This can be done via unsupervised algorithms like clustering. Work can also be done to make the models more explainable especially if we are using PCA and ANNs to analyze data. We can also get indirect data about the prospective applicant like their driving behavior, education record etc to see if these attributes contribute to better risk profiling than the already available data.<br />
<br />
<br />
'''Critics'''<br />
<br />
----<br />
Since the project built multiple models and had utilized various methods to evaluate the result. They could potentially ensemble the prediction, such as averaging the result of the different models, to achieve a better accuracy result. Another method is model stacking, we can input the result of one model as input into another model for better results. However, they do have some major setbacks: sometimes, the result could be effect negatively (ie: increase the RMSE). In addition, if the improvement is not prominent, it would make the process much more complex thus cost time and effort. In a research setting, stacking and ensembling are definitely worth a try. In a real-life business case, it is more of a trade-off between accuracy and effort/cost. <br />
<br />
<br />
'''References'''<br />
<br />
----<br />
<br />
<br />
Chen, T. (2016). Corporate reputation and financial performance of Life Insurers. Geneva Papers Risk Insur Issues Pract, 378-397.<br />
<br />
Cummins J, S. B. (2013). Risk classification in Life Insurance. Springer 1st Edition.<br />
<br />
J Carson, C. E. (2017). Sunk costs and screening: two-part tariffs in life insurance. SSRN Electron J, 1-26.<br />
<br />
Jayabalan, N. B. (2018). Risk prediction in life insurance industry using supervised learning algorithms. Complex & Intelligent Systems, 145-154.<br />
<br />
Mishr, K. (2016). Fundamentals of life insurance theories and applications. PHI Learning Pvt Ltd.<br />
<br />
Prince, A. (2016). Tantamount to fraud? Exploring non-disclosure of genetic information in life insurance applications as grounds for policy recession. Health Matrix, 255-307.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42779F21-STAT 441/841 CM 763-Proposal2020-10-14T00:17:33Z<p>Gtompkin: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Convolution Neural Network for Rainy day Prediction<br />
<br />
'''Description:'''<br />
<br />
Our project is an application on rainy day prediction using convolution neural network. The goal of our project is making a prediction if tomorrow is going to be a rainy day by using history data of the past week and some indicators such as temperature. We are planning to get the past weather data by Yahoo web API.<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)<br />
'''Description:'''<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction.<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42778F21-STAT 441/841 CM 763-Proposal2020-10-14T00:17:07Z<p>Gtompkin: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Convolution Neural Network for Rainy day Prediction<br />
<br />
'''Description:'''<br />
<br />
Our project is an application on rainy day prediction using convolution neural network. The goal of our project is making a prediction if tomorrow is going to be a rainy day by using history data of the past week and some indicators such as temperature. We are planning to get the past weather data by Yahoo web API.<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Grace Tompkins<br />
<br />
Tatianna Krikella<br />
<br />
'''Title:''' An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)<br />
'''Description:'''<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction.<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841proposal&diff=42772f11Stat841proposal2020-10-11T18:56:28Z<p>Gtompkin: /* By: Grace Tompkins and Tatiana Krikella */</p>
<hr />
<div></noinclude><br />
==Project 1 : Classification of Disease Status== <br />
===By: Lai,ChunWei and Greg Pitt===<br />
For our classification project, we are proposing an application in the<br />
medical diagnosis field: For each patient or lab animal, there will<br />
be results from a large number of genetic and/or chemical tests. We<br />
should be able to predict the disease state of the patient/animal,<br />
based on the presence or absence of certain biomarkers and/or chemical<br />
markers.<br />
<br />
Our project work will include the reduction of dimensionality, and the<br />
development or one or more classification functions, with the<br />
objectives of minimizing the error rate and also reducing the number<br />
of markers required in order to make good predictions. Our results<br />
could be used at the patient level, to help make accurate diagnoses,<br />
and at the population health level, to make epidemiological surveys of<br />
the prevalence of certain medical conditions. In both cases, the<br />
results should enable the healthcare system to make better decisions<br />
regarding the deployment of scarce healthcare resources.<br />
<br />
Our methodology will be chosen soon, after we have seen a few more<br />
examples in class. If time permits, we will also attempt a novel<br />
classification procedure of our own design.<br />
<br />
Currently we have access to a dataset from the SSC data mining<br />
section, and we hope to be able to get access to some similar, but<br />
larger, datasets before the end of the term.<br />
<br />
The software tools that we use will probably include Matlab, Python, and R.<br />
<br />
We would like to obtain publishable results if possible, but this is<br />
not a primary objective.<br />
<br />
<noinclude><br />
<br />
<br />
<br />
==Proposal 2: The Golden Retrieber==<br />
<br />
===By Cameron Davidson-Pilon and Jenn Smith===<br />
<br />
Our goal of this project is to determine statistical results from the population of Twitter users that have a specific celebrity in their display picture. Our algorithm will scan through Twitter's display pictures, and attempt to determine whether a display picture features Canada's most famous icon: '''Justin Beiber'''. We will hope that most images contain his trademark swoosh hairstyle, as much of or classification will rely on such handsome features. <br />
<br />
After we determine, with some probability of error, that a user has a Beiber Display Picture (BDP), we can then do a statistical analysis on the sample population's tweets, hashtags etc. <br />
<br />
Applications of this algorithm include the Twitter behaviour of Bieber fans. It can be used in an app for companies that want to target such demographics. <br />
<br />
We will be using Matlab and Python.<br />
<br />
==Project 3 : Classifying Melanoma Images ==<br />
</noinclude><br />
===By: Robert Amelard===<br />
Currently, the method of manually diagnosing melanoma is very subjective. Dermatologists essentially look at a skin lesion and determine from their experience if it looks malignant or benign. Some popular methods for diagnosis are the 7-point checklist and the ABCD rubric. They are both based on very subjective criteria, such as the "irregularity" of a skin lesion. My project will attempt to classify an input image containing a skin lesion into the class "melanoma" or "not melanoma" based on features that are regarded as high risk with regard to these rubrics. This will help doctors come to more a quantitative, objectively justifiable diagnosis of patients.<br />
<br />
<br />
==Project 4 : application of ANN on handwritten character recognition ==<br />
</noinclude><br />
===By: Chen Wang; YuanHong Yu; Jia Zhou===<br />
<br />
Hand-printed names are considered as one of some major approaches to authenticate a person’s identity. Handwritings of different person are different; therefore it is very difficult to recognize the handwritten characters. Handwritten Character recognition is an area of pattern recognition that has become the subject of research. We apply the ANN to the collected images of handwritten English characters to match the 26 letters of alphabet. Finally, the performance is evaluated by running with test samples.<br />
<br />
The possible software and tools we would like to use include: R, Matlab.<br />
<br />
==Project 5 : Distributed Classification and Data Fusion in Wireless Sensor Networks==<br />
</noinclude><br />
===By : Mahmoud Faraj===<br />
Wireless sensor networks (WSNs) are a recently emerging technology consisting of a large number of battery powered sensor nodes interconnected wirelessly and capable of monitoring environments, tracking targets, and performing many other critical applications. The design and deployment of such type of network are challenging tasks due to the imperfect nature of the communicated nodes (i.e., sensors) in the WSNs. The dramatic depletion of the sensor’s energy while performing the regular tasks (e.g. sensing, processing, receiving and transmitting information) constitutes a major threat of shortening the lifetime of the network. That is due to the limited amount of energy in the sensor which is constrained by the dimensions of these sensors. The lack of energy makes the lifetime of the network shorter. Also, the death of some nodes causes partitioning the network. As a result, some nodes become not able to communicate with others to accomplish the ultimate goal of the remotely deployed network.<br />
<br />
In our research work, we propose one of the techniques learned in the course to be used for performing distributed classification of a moving target (e.g. vehicle, animal, or person). Each sensor node will be able to classify the moving target and then track it in the WSN field. In order to conserve power and extend the lifetime of the network, we also propose distributed (in-network) data fusion by using Distributed Kalman Filter where the data are fused in the network instead of having all the data transmitted to the fusion center (sink). Each node processes the data from its own set of sensor, communicating with neighbouring nodes to improve the classification and location estimates of the moving target. Simulation results will be provided to demonstrate the significant advantages of using distributed classification and data fusion and also to show the improvement of the WSN as a whole.<br />
<br />
==Project 6 : Skin Classification==<br />
</noinclude><br />
===By : Jeffrey Glaister===<br />
My goal for this project is to classify segments of an oversegmented image as skin or skin lesion (a two-class problem). The overall goal and application is to automatically segment the skin lesion in pictures of patients at risk of melanoma, unsupervised. A standard segmentation algorithm will be applied to the image to oversegment it. Then, the resulting segments will have classified as normal skin or not normal skin and be merged. Of particular interest is texture and colour classification, since skin and lesions differ slightly in texture and colour. Other possible features include spatial location and initial segment size. Time permitting, novel texture and colour classifiers will be investigated. <br />
<br />
I have access to skin lesion images from a public database, some of which have been manually contoured to test the algorithm. <br />
<br />
I will be using Matlab.<br />
<br />
==Project 7 : Sentiment Analysis Controversial Topic==<br />
</noinclude><br />
===By : Samson Hu, Blair Rose, Mikhail Targonski===<br />
Classify documents whether they are for against a specific topic. <br />
<br />
The possible software and tools include but not limited to: Matlab, R.<br />
<br />
==Project 8 : Reproducing the Results of a 1985 Paper ==<br />
</noinclude><br />
===By : Mohamed El Massad===<br />
For the CM 763 final project, I intend to reproduce the results of Keller‘s 1985 IEEE paper in which the Fuzzy K-Nearest Neighbour algorithm was first introduced. To develop the algorithm, the authors of the paper introduced the theory of fuzzy-sets into the well-known K-nearest neighbour decision rule. Their motivation was to address a limitation with the said rule that it gives the training samples equal weight in deciding the class memberships of the patterns to be classified, regardless of the degree to which these samples are representative of their own classes. The authors proposed three methods for assigning initial fuzzy memberships to the samples in the training data set, and presented experimental results and comparisons to the crisp version of the algorithm showing how their proposed one outperforms it in terms of the error rate. They also show that their algorithm compares well against other more sophisticated pattern classification procedures, including the Bayes classifier and the perceptron. Finally, they develop a fuzzy analog to the nearest prototype algorithm.<br />
<br />
The authors of the paper used FORTRAN to implement their proposed algorithms, but I will probably use MATLAB to do that and maybe some C as well.<br />
<br />
==Project 9 : Learning from Weak Teachers ==<br />
</noinclude><br />
===By : Nika Haghtalab===<br />
<br />
This work addresses the problem of learning when labeled data varies in quality depending on the quality of the teacher. In many scenarios we have access to "weak" but readily available teachers, but restricted access "stronger" teachers.<br />
<br />
We attempt to extend the current framework of such learning scenarios, we attempt to introduce a hierarchy of teachers with different label qualities. We will focus on how an output classifier should be computed from labels of such group of teachers. We are also interested in deciding when it is economical to refer to an instance to a higher quality teachers in order to increase the performance of the classifier.<br />
<br />
==Project 10 : Classification of 3-dimensional objects ==<br />
</noinclude><br />
===By : Kenneth Webster, Soo Min Kang, and Hang Su===<br />
<br />
Solvability of 3-dimensional physical systems.<br />
Can computers determine whether a jumbled rubix cube can be solved?<br />
<br />
<br />
== Project 11: Learning in Robots ==<br />
</noinclude><br />
=== By: Guoting (Jane) Chang ===<br />
<br />
<br />
'''Background'''<br />
<br />
One of the long term goals in robotics is for robots (such as humanoid robots) to become useful in human environments. In order for robots to be able to perform different services within society, they will need the ability to carry out new tasks and to adapt to changing environments. This in turn requires robots to have a capacity for learning. However, existing implementations of learning in robots tend to focus on specific tasks and are not easily extended to different tasks and environments [1].<br />
<br />
'''Proposed Project'''<br />
<br />
The purpose of the proposed work is to continue developing an initial framework for a learning system that is not task or environment specific. Such a generalized learning strategy should be achievable through hierarchical knowledge abstraction and appropriate knowledge representation. At the lowest level of the hierarchy, vision techniques will be used to extract features (such as colors, contours and position information) from raw input video data. On the next level of the hierarchy,<br />
the extracted features will be combined using clustering techniques such as self-organizing maps to perform object recognition. Furthermore, in order to learn to recognize motions shown in the videos, techniques such as incremental decision trees should be investigated for performing guided clustering (i.e., clustering based on some metric). At the higher levels of the hierarchy, the sequence of motions and objects involved in the video should be represented using connectionist models such as directed graphs. <br />
<br />
The main focus of the proposed work for this project will be on the clustering of observed motions, as it is most closely related to the classification techniques that will be taught in class. An incremental decision tree is tentatively being considered for this, as the goal is to determine whether a newly observed motion belongs to a group of motions that has been seen before or whether it is a new motion and the knowledge representation should be updated to include it. Matlab or C/C++ code will most likely be used for this project.<br />
<br />
'''Reference'''<br />
<br />
[1] A. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in ''Third IEEE International Conference on Development and Learning.'' San Diego, California, USA: IEEE, 2004.<br />
<br />
<br />
== Project 12: Stock price forecasting ==<br />
</noinclude><br />
=== By: Zhe (Gigi) Wang, Chi-Yin (Johnny) Chow ===<br />
<br />
'''Proposal'''<br />
<br />
Under Efficient Market Hypothesis, stock prices are completely unpredictable and any information that are publicly available (weak-form of EMH) is already reflected in the stock prices. However, an experienced trader can have a "feel" or prediction of the prices in the future, based on the history of prices or other factors.<br />
<br />
In this project, we will apply component analysis techniques to identify or recognize any significant patterns and then employ the Support Vector Machine technique as the prediction model. After applying the model to the data, we plan to evaluate the accuracy of the prediction, and compare it with other state-of-the-art techniques.<br />
<br />
'''References'''<br />
<br />
[1] Ince, H., Trafalis, T.B., "Kernel principal component analysis and support vector machines for stock price prediction", Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference.<br />
<br />
[2] Chi-Jie Lu, Jui-Yu Wu, Cheng-Ruei Fan ; Chih-Chou Chiu, "Forecasting stock price using Nonlinear independent component analysis and support vector regression", Industrial Engineering and Engineering Management, 2009. IEEM 2009. IEEE International Conference.<br />
<br />
==Project 13: UFO Sightings ==<br />
</noinclude><br />
===By Vishnu Pothugunta===<br />
<br />
There have been a lot of UFO sightings in the past decade. The goal is to use classification methods and predict where and when could the UFO sightings happen. From the past data about the ufo sightings, we can also try to predict the shape of the UFO and the duration of the sighting.<br />
<br />
==Project 14: Identifying Accounting Fraud Using Statistical Learning ==<br />
===By Daniel Severn===<br />
<br />
'''Proposal'''<br />
<br />
By constructing a data set from key financial ratios from financial statements I hope to create a statistical classifier that can accurately identify the companies who engage in accounting fraud. The following is a paper with a similar goal. http://www.waset.org/journals/ijims/v3/v3-2-13.pdf. I would like to use methods from class but perhaps also the C4.5 method as in the previously linked paper it provided a greatly superior classifier.<br />
<br />
'''Relevance'''<br />
<br />
This is of obvious relevance to securities commissions but it also relevant useful to any investor. By adding such a classifier to some investors typical methods they can identify suspect companies and profit from the high probability of a stock decline when the fraud is uncovered. This in turn will reduce the stock price of companies that engage in heavy management and manipulation of their financial statements. This would reduce/remove the incentive of managing/manipulating financial statements which benefits the entire financial system and thus economy.<br />
<br />
'''Challenges'''<br />
<br />
This analysis requires a quality data set. Finding/creating such a data set may be challenging.<br />
<br />
<br />
==Project 15 : A survey on artificial neural networks (ANN)==<br />
</noinclude><br />
===By: Hojat Abdolanezhad & Carolyn Augusta===<br />
<br />
A brief history and the function of ANN, explanation of common terms used in ANN (perceptron, back propagation, sematic net, etc.)<br />
and general philosophy of neural networks. types of ANNs researched in the past, leading into the present. We will mention <br />
new trends in this area. Also an application of neural networks in classification will be discussed.<br />
<br />
'''A note'''<br />
Artificial neural networks (ANN) have been very useful to solve real world problems. In Economics, ANNs can be applied to predict the profit, market<br />
trends, and price levels based on the market’s databases from the past. In industry, engineers can apply ANNs to solve many nonlinear engineering problems such as classifications, prediction, pattern recognition, where the tasks are very difficult to solve using normal mathematical tools.<br />
<br />
'''Useful papers:'''<br />
<br />
H. White, “Learning in artificial neural networks: A statistical<br />
perspective,” Neural Comput., vol. 1, pp. 425–464, 1989.<br />
<br />
E. Wan, “Neural network classification: A Bayesian interpretation,”<br />
IEEE Trans. Neural Networks, vol. 1, no. 4, pp. 303–305, 1990.<br />
<br />
P.G. Zhang, “Neural Networks for Classification: A Survey,” IEEE<br />
Trans. Systems, Man, and Cybernetics, vol. 30, no. 4, pp. 451-462,<br />
2000. Christopher Bishop. Neural Networks for Pattern Recognition.<br />
Oxford University Press, London, UK, 1995<br />
<br />
Ripley, B. D. (1994a) Neural networks and related methods for<br />
classification (with discussion). Journal of the Royal Statistical<br />
Society series B 56, 409–456.<br />
<br />
==Project 16 : Feature extraction in news article clustering==<br />
===By Gobaan Raveendran & Daniel Nicoara ===<br />
Our project will focus on crawling the internet for various news articles from many different sources and then classifying these into sets of either left or right wing blogs. The projects focus will be on feature extraction, and determining which features are important for classification. <br />
<br />
For supervised data, we will either automatically assign classes based on domain and see if the article fits in the predicted domain, or we will use an external system such as a topic model.<br />
<br />
==Project 17: Classification of harp seals==<br />
===By Zhikang Huang, Haoyang Fu, Mengfei Yang===<br />
Seals possess varied repertoires of underwater vocalisations. Geographic variation in call types have been reported for Weddell and bearded seal species, and the variations have been attributed to the isolation of breeding populations within these species.<br />
<br />
Our project will focus on harp seals (Phoca groenlandica), and in particular the herds from Jan Mayen Island, Gulf of St. Lawrence and Front. Our goal is to classify harp seals using the data which were obtained from underwater recordings of harp seals in these three herds. Nine hundred calls from each of the three herds are used to be our training set, and we will use three hundred calls as our test set to estimate our predictive model. <br />
<br />
We will use tree models or logistic regression models as our predictive model to do the classification, and we will select the best model with the smallest error rate. Also, our model will include seven variables which are as follows:<br />
<br />
'''ELEMDUR''' - this is the duration of a single element of a harp seal underwater vocalisation. It is measured in milliseconds.<br />
<br />
'''INTERDUR''' - this is the time between elements in multiple element calls. It is measured in milliseconds. Note that not all calls have multiple elements so this variable is absent in single element calls. Where absent, a value of NA is recorded in the data.<br />
<br />
'''NO_ELEM''' - this is the number of elements of the call. In harp seals all of the elements within a single call are similar and the spacing between them is constant.<br />
<br />
'''STARTFREQ''' - this is the pitch at the start of the call or the highest pitch if the call has an extremely short duration (call shape 0 below).<br />
<br />
'''ENDFRE''' - this is the pitch at the end of the call or the lowest pitch if the call has an extremely short duration (call shape 0).<br />
<br />
'''WAVEFORM''' - this codes a series of waveform shapes (a plot of amplitude vs time) which lie more or less along a continuum. The waveform shapes <br />
are: <br />
frequency modulated sinusoidal 9<br />
slight frequency modulated and complex 8<br />
sinusoidal (pure tone) 7<br />
complex (irregular waveform) 5<br />
amplitude pulses 4 burst pulses 3<br />
knock (short burst pulse) 2<br />
click (very short duration) 1 <br />
<br />
'''CALLSHAP''' - this codes a series of call shapes as they would appear in a sonogram spectral analysis (a plot of frequency vs time). <br />
<br />
'''HERD''' - this is the herd from which the recordings were obtained. The classification recording of herds is as follows:<br />
Jan Mayen Island 1 <br />
Gulf of St. Lawrence 2 <br />
Front 3<br />
<br />
'''Reference'''<br />
<br />
Terhune, J.M. (1994) Geographical variation of harp seal underwater vocalisations, Can. J. Zoology 72(5) 892-897.<br />
<br />
Statistics Society of Canada, http://www.ssc.ca/en/education/archived-case-studies/seal-vocalisations.<br />
<br />
==Project 18 : Classifying Vehicle License Plates ==<br />
===By : Jun Kai Shan, Su Rong You===<br />
Our group will focus on using statistical methods to classify Canadian vehicle license plates. We will use MATLAB to find a way to classify letters, digits and province of the license plates. We will take pictures of the license plates of different cars and use the pictures as our data. We will use R to do further statistical analysis.<br />
<br />
<br />
==Project 19 : Ice/No-Ice Classification==<br />
===By : Steven Leigh===<br />
Modern satellites collect massive amounts of earth imagery limiting the usefulness of humans for image interpretation. This project will attempt to tackle the problem of identifying ice and open water automatically from satellite imagery. Multimodal data will be considered such as multipolar SAR data, optical data and thematic data to name a few.<br />
<br />
==Project 20: A survey on Support Vector Machine==<br />
===By Monsef Tahir===<br />
The support vector machine is a training algorithm for learning classification and regression rules from data, for example the SVM can be used to learn polynomial, radial basis function (RBF) and multi-layer perceptron (MLP) classifiers. SVMs are based on the structural risk minimisation principle, closely related to regularisation theory. This principle incorporates capacity control to prevent over-fitting and thus is a partial solution to the bias-variance trade-off dilemma.<br />
<br />
In this project, a survey will be made on SVM and a comparison with other tools in terms of classification and prediction will be performed as well.<br />
<br />
<br />
==Project 21: Good/Bad-day Classification==<br />
===By Carl J. Wensater===<br />
The proposed project is to over the course of two weeks gather and parameterize data about daily activities such as workout, food intake, hours of sleep, amount of spare time etc. This data will be used as training data for a two class classifier that will try to distinguish good from bad days. If the classification is successful statistical analysis can be used to identify the crucial components of a good day.<br />
<br />
<br />
==Project 22: SVM-Based Classification of Peer-to-Peer Internet Traffic==<br />
===By: Talieh Seyed Tabatabaei===<br />
<br />
In recent years, Peer-to-Peer (P2P) file-exchange applications have overtaken Web applications as the major contributor of traffic on the Internet. Recent estimates put the volume of P2P traffic at 70% of the total broadband traffic. P2P is often used for illegally sharing copyrighted music, video, games, and software. The legal ramification of this traffic combined with its aggressive use of network resources has necessitated a strong need for identification of network traffic by application type. This task, referred to as traffic classification, is a pre-requisite to many network management and traffic engineering problems.<br />
<br />
In this project least squared support vector machines (LS-SVM) is going to be adopted in order to identify p2p traffic using some flow-based statistical features.<br />
<br />
==Project 23: A survey on a fast learning algorithm on deep belief nets==<br />
===By: Seyed Seifi===<br />
<br />
Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution<br />
of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all<br />
of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases.<br />
<br />
I am trying to have a survey on the novel learning algorithm which is proposed by Prof. Geoffrey E. Hinton on deep belief nets.<br />
<br />
<br />
http://www.cs.toronto.edu/~hinton/<br />
<br />
http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf<br />
<br />
==Project 24: Feature Extraction using SVM==<br />
===By: Ad Tayal===<br />
<br />
Explore the idea of using support vector machines for joint feature extraction and classification.<br />
<br />
==Project 25: Applying distribution matching to a reinforcement learning problem==<br />
===By: Beomjoon Kim===<br />
<br />
In this project, I will try to correct the sample selection bias in the famous reinforcement learning problem, known as "Tiger Problem", using distribution matching.<br />
<br />
<br />
==Project 26: Identifying a person in pictures==<br />
</noinclude><br />
===By: Yan(Serena) Sun, Jorge Munoz, Cyrus Wu, Baozhe Chen ===<br />
<br />
<br />
For our classification project, we are trying to use classification methods to identify a person in different pictures. The information given is: his/her hair color, shirt color, pants color, standing up or sitting down. Based on this information, we divide the original picture into smaller pixels (maybe 100*100 small blocks) and divide the picture of the person given into smaller pixels with same sizes as the pixels of the original picture. We try to identify the pixels with the same pattern and colors from the picture, which would be the person we try to look for from the pictures. We can compare what we find by the algorithm with the real person to calculate the misclassified rate. <br />
The possible software and tools include: R, Matlab.<br />
<br />
==Project 27: Artificial Neural Networks to predict extreme losses in Financial Time Series==<br />
===By: Justin Francis ===<br />
<br />
A common issue in finance modelling is being able to predict when an extreme loss is about to occur. Most time series exhibit a level of autocorrelation, meaning there should be signals in the data for this extreme loss before it happens. I will experiment with a few artificial networks and train them on historical market index returns to see how well these phenomena can be predicted.<br />
<br />
<br />
==Project 28: An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)==<br />
===By: Grace Tompkins and Tatiana Krikella ===<br />
<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction. <br />
<br />
We will use R for analysis.<br />
<br />
Reference: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841proposal&diff=42771f11Stat841proposal2020-10-11T18:54:17Z<p>Gtompkin: /* Project 28: TBD */</p>
<hr />
<div></noinclude><br />
==Project 1 : Classification of Disease Status== <br />
===By: Lai,ChunWei and Greg Pitt===<br />
For our classification project, we are proposing an application in the<br />
medical diagnosis field: For each patient or lab animal, there will<br />
be results from a large number of genetic and/or chemical tests. We<br />
should be able to predict the disease state of the patient/animal,<br />
based on the presence or absence of certain biomarkers and/or chemical<br />
markers.<br />
<br />
Our project work will include the reduction of dimensionality, and the<br />
development or one or more classification functions, with the<br />
objectives of minimizing the error rate and also reducing the number<br />
of markers required in order to make good predictions. Our results<br />
could be used at the patient level, to help make accurate diagnoses,<br />
and at the population health level, to make epidemiological surveys of<br />
the prevalence of certain medical conditions. In both cases, the<br />
results should enable the healthcare system to make better decisions<br />
regarding the deployment of scarce healthcare resources.<br />
<br />
Our methodology will be chosen soon, after we have seen a few more<br />
examples in class. If time permits, we will also attempt a novel<br />
classification procedure of our own design.<br />
<br />
Currently we have access to a dataset from the SSC data mining<br />
section, and we hope to be able to get access to some similar, but<br />
larger, datasets before the end of the term.<br />
<br />
The software tools that we use will probably include Matlab, Python, and R.<br />
<br />
We would like to obtain publishable results if possible, but this is<br />
not a primary objective.<br />
<br />
<noinclude><br />
<br />
<br />
<br />
==Proposal 2: The Golden Retrieber==<br />
<br />
===By Cameron Davidson-Pilon and Jenn Smith===<br />
<br />
Our goal of this project is to determine statistical results from the population of Twitter users that have a specific celebrity in their display picture. Our algorithm will scan through Twitter's display pictures, and attempt to determine whether a display picture features Canada's most famous icon: '''Justin Beiber'''. We will hope that most images contain his trademark swoosh hairstyle, as much of or classification will rely on such handsome features. <br />
<br />
After we determine, with some probability of error, that a user has a Beiber Display Picture (BDP), we can then do a statistical analysis on the sample population's tweets, hashtags etc. <br />
<br />
Applications of this algorithm include the Twitter behaviour of Bieber fans. It can be used in an app for companies that want to target such demographics. <br />
<br />
We will be using Matlab and Python.<br />
<br />
==Project 3 : Classifying Melanoma Images ==<br />
</noinclude><br />
===By: Robert Amelard===<br />
Currently, the method of manually diagnosing melanoma is very subjective. Dermatologists essentially look at a skin lesion and determine from their experience if it looks malignant or benign. Some popular methods for diagnosis are the 7-point checklist and the ABCD rubric. They are both based on very subjective criteria, such as the "irregularity" of a skin lesion. My project will attempt to classify an input image containing a skin lesion into the class "melanoma" or "not melanoma" based on features that are regarded as high risk with regard to these rubrics. This will help doctors come to more a quantitative, objectively justifiable diagnosis of patients.<br />
<br />
<br />
==Project 4 : application of ANN on handwritten character recognition ==<br />
</noinclude><br />
===By: Chen Wang; YuanHong Yu; Jia Zhou===<br />
<br />
Hand-printed names are considered as one of some major approaches to authenticate a person’s identity. Handwritings of different person are different; therefore it is very difficult to recognize the handwritten characters. Handwritten Character recognition is an area of pattern recognition that has become the subject of research. We apply the ANN to the collected images of handwritten English characters to match the 26 letters of alphabet. Finally, the performance is evaluated by running with test samples.<br />
<br />
The possible software and tools we would like to use include: R, Matlab.<br />
<br />
==Project 5 : Distributed Classification and Data Fusion in Wireless Sensor Networks==<br />
</noinclude><br />
===By : Mahmoud Faraj===<br />
Wireless sensor networks (WSNs) are a recently emerging technology consisting of a large number of battery powered sensor nodes interconnected wirelessly and capable of monitoring environments, tracking targets, and performing many other critical applications. The design and deployment of such type of network are challenging tasks due to the imperfect nature of the communicated nodes (i.e., sensors) in the WSNs. The dramatic depletion of the sensor’s energy while performing the regular tasks (e.g. sensing, processing, receiving and transmitting information) constitutes a major threat of shortening the lifetime of the network. That is due to the limited amount of energy in the sensor which is constrained by the dimensions of these sensors. The lack of energy makes the lifetime of the network shorter. Also, the death of some nodes causes partitioning the network. As a result, some nodes become not able to communicate with others to accomplish the ultimate goal of the remotely deployed network.<br />
<br />
In our research work, we propose one of the techniques learned in the course to be used for performing distributed classification of a moving target (e.g. vehicle, animal, or person). Each sensor node will be able to classify the moving target and then track it in the WSN field. In order to conserve power and extend the lifetime of the network, we also propose distributed (in-network) data fusion by using Distributed Kalman Filter where the data are fused in the network instead of having all the data transmitted to the fusion center (sink). Each node processes the data from its own set of sensor, communicating with neighbouring nodes to improve the classification and location estimates of the moving target. Simulation results will be provided to demonstrate the significant advantages of using distributed classification and data fusion and also to show the improvement of the WSN as a whole.<br />
<br />
==Project 6 : Skin Classification==<br />
</noinclude><br />
===By : Jeffrey Glaister===<br />
My goal for this project is to classify segments of an oversegmented image as skin or skin lesion (a two-class problem). The overall goal and application is to automatically segment the skin lesion in pictures of patients at risk of melanoma, unsupervised. A standard segmentation algorithm will be applied to the image to oversegment it. Then, the resulting segments will have classified as normal skin or not normal skin and be merged. Of particular interest is texture and colour classification, since skin and lesions differ slightly in texture and colour. Other possible features include spatial location and initial segment size. Time permitting, novel texture and colour classifiers will be investigated. <br />
<br />
I have access to skin lesion images from a public database, some of which have been manually contoured to test the algorithm. <br />
<br />
I will be using Matlab.<br />
<br />
==Project 7 : Sentiment Analysis Controversial Topic==<br />
</noinclude><br />
===By : Samson Hu, Blair Rose, Mikhail Targonski===<br />
Classify documents whether they are for against a specific topic. <br />
<br />
The possible software and tools include but not limited to: Matlab, R.<br />
<br />
==Project 8 : Reproducing the Results of a 1985 Paper ==<br />
</noinclude><br />
===By : Mohamed El Massad===<br />
For the CM 763 final project, I intend to reproduce the results of Keller‘s 1985 IEEE paper in which the Fuzzy K-Nearest Neighbour algorithm was first introduced. To develop the algorithm, the authors of the paper introduced the theory of fuzzy-sets into the well-known K-nearest neighbour decision rule. Their motivation was to address a limitation with the said rule that it gives the training samples equal weight in deciding the class memberships of the patterns to be classified, regardless of the degree to which these samples are representative of their own classes. The authors proposed three methods for assigning initial fuzzy memberships to the samples in the training data set, and presented experimental results and comparisons to the crisp version of the algorithm showing how their proposed one outperforms it in terms of the error rate. They also show that their algorithm compares well against other more sophisticated pattern classification procedures, including the Bayes classifier and the perceptron. Finally, they develop a fuzzy analog to the nearest prototype algorithm.<br />
<br />
The authors of the paper used FORTRAN to implement their proposed algorithms, but I will probably use MATLAB to do that and maybe some C as well.<br />
<br />
==Project 9 : Learning from Weak Teachers ==<br />
</noinclude><br />
===By : Nika Haghtalab===<br />
<br />
This work addresses the problem of learning when labeled data varies in quality depending on the quality of the teacher. In many scenarios we have access to "weak" but readily available teachers, but restricted access "stronger" teachers.<br />
<br />
We attempt to extend the current framework of such learning scenarios, we attempt to introduce a hierarchy of teachers with different label qualities. We will focus on how an output classifier should be computed from labels of such group of teachers. We are also interested in deciding when it is economical to refer to an instance to a higher quality teachers in order to increase the performance of the classifier.<br />
<br />
==Project 10 : Classification of 3-dimensional objects ==<br />
</noinclude><br />
===By : Kenneth Webster, Soo Min Kang, and Hang Su===<br />
<br />
Solvability of 3-dimensional physical systems.<br />
Can computers determine whether a jumbled rubix cube can be solved?<br />
<br />
<br />
== Project 11: Learning in Robots ==<br />
</noinclude><br />
=== By: Guoting (Jane) Chang ===<br />
<br />
<br />
'''Background'''<br />
<br />
One of the long term goals in robotics is for robots (such as humanoid robots) to become useful in human environments. In order for robots to be able to perform different services within society, they will need the ability to carry out new tasks and to adapt to changing environments. This in turn requires robots to have a capacity for learning. However, existing implementations of learning in robots tend to focus on specific tasks and are not easily extended to different tasks and environments [1].<br />
<br />
'''Proposed Project'''<br />
<br />
The purpose of the proposed work is to continue developing an initial framework for a learning system that is not task or environment specific. Such a generalized learning strategy should be achievable through hierarchical knowledge abstraction and appropriate knowledge representation. At the lowest level of the hierarchy, vision techniques will be used to extract features (such as colors, contours and position information) from raw input video data. On the next level of the hierarchy,<br />
the extracted features will be combined using clustering techniques such as self-organizing maps to perform object recognition. Furthermore, in order to learn to recognize motions shown in the videos, techniques such as incremental decision trees should be investigated for performing guided clustering (i.e., clustering based on some metric). At the higher levels of the hierarchy, the sequence of motions and objects involved in the video should be represented using connectionist models such as directed graphs. <br />
<br />
The main focus of the proposed work for this project will be on the clustering of observed motions, as it is most closely related to the classification techniques that will be taught in class. An incremental decision tree is tentatively being considered for this, as the goal is to determine whether a newly observed motion belongs to a group of motions that has been seen before or whether it is a new motion and the knowledge representation should be updated to include it. Matlab or C/C++ code will most likely be used for this project.<br />
<br />
'''Reference'''<br />
<br />
[1] A. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in ''Third IEEE International Conference on Development and Learning.'' San Diego, California, USA: IEEE, 2004.<br />
<br />
<br />
== Project 12: Stock price forecasting ==<br />
</noinclude><br />
=== By: Zhe (Gigi) Wang, Chi-Yin (Johnny) Chow ===<br />
<br />
'''Proposal'''<br />
<br />
Under Efficient Market Hypothesis, stock prices are completely unpredictable and any information that are publicly available (weak-form of EMH) is already reflected in the stock prices. However, an experienced trader can have a "feel" or prediction of the prices in the future, based on the history of prices or other factors.<br />
<br />
In this project, we will apply component analysis techniques to identify or recognize any significant patterns and then employ the Support Vector Machine technique as the prediction model. After applying the model to the data, we plan to evaluate the accuracy of the prediction, and compare it with other state-of-the-art techniques.<br />
<br />
'''References'''<br />
<br />
[1] Ince, H., Trafalis, T.B., "Kernel principal component analysis and support vector machines for stock price prediction", Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference.<br />
<br />
[2] Chi-Jie Lu, Jui-Yu Wu, Cheng-Ruei Fan ; Chih-Chou Chiu, "Forecasting stock price using Nonlinear independent component analysis and support vector regression", Industrial Engineering and Engineering Management, 2009. IEEM 2009. IEEE International Conference.<br />
<br />
==Project 13: UFO Sightings ==<br />
</noinclude><br />
===By Vishnu Pothugunta===<br />
<br />
There have been a lot of UFO sightings in the past decade. The goal is to use classification methods and predict where and when could the UFO sightings happen. From the past data about the ufo sightings, we can also try to predict the shape of the UFO and the duration of the sighting.<br />
<br />
==Project 14: Identifying Accounting Fraud Using Statistical Learning ==<br />
===By Daniel Severn===<br />
<br />
'''Proposal'''<br />
<br />
By constructing a data set from key financial ratios from financial statements I hope to create a statistical classifier that can accurately identify the companies who engage in accounting fraud. The following is a paper with a similar goal. http://www.waset.org/journals/ijims/v3/v3-2-13.pdf. I would like to use methods from class but perhaps also the C4.5 method as in the previously linked paper it provided a greatly superior classifier.<br />
<br />
'''Relevance'''<br />
<br />
This is of obvious relevance to securities commissions but it also relevant useful to any investor. By adding such a classifier to some investors typical methods they can identify suspect companies and profit from the high probability of a stock decline when the fraud is uncovered. This in turn will reduce the stock price of companies that engage in heavy management and manipulation of their financial statements. This would reduce/remove the incentive of managing/manipulating financial statements which benefits the entire financial system and thus economy.<br />
<br />
'''Challenges'''<br />
<br />
This analysis requires a quality data set. Finding/creating such a data set may be challenging.<br />
<br />
<br />
==Project 15 : A survey on artificial neural networks (ANN)==<br />
</noinclude><br />
===By: Hojat Abdolanezhad & Carolyn Augusta===<br />
<br />
A brief history and the function of ANN, explanation of common terms used in ANN (perceptron, back propagation, sematic net, etc.)<br />
and general philosophy of neural networks. types of ANNs researched in the past, leading into the present. We will mention <br />
new trends in this area. Also an application of neural networks in classification will be discussed.<br />
<br />
'''A note'''<br />
Artificial neural networks (ANN) have been very useful to solve real world problems. In Economics, ANNs can be applied to predict the profit, market<br />
trends, and price levels based on the market’s databases from the past. In industry, engineers can apply ANNs to solve many nonlinear engineering problems such as classifications, prediction, pattern recognition, where the tasks are very difficult to solve using normal mathematical tools.<br />
<br />
'''Useful papers:'''<br />
<br />
H. White, “Learning in artificial neural networks: A statistical<br />
perspective,” Neural Comput., vol. 1, pp. 425–464, 1989.<br />
<br />
E. Wan, “Neural network classification: A Bayesian interpretation,”<br />
IEEE Trans. Neural Networks, vol. 1, no. 4, pp. 303–305, 1990.<br />
<br />
P.G. Zhang, “Neural Networks for Classification: A Survey,” IEEE<br />
Trans. Systems, Man, and Cybernetics, vol. 30, no. 4, pp. 451-462,<br />
2000. Christopher Bishop. Neural Networks for Pattern Recognition.<br />
Oxford University Press, London, UK, 1995<br />
<br />
Ripley, B. D. (1994a) Neural networks and related methods for<br />
classification (with discussion). Journal of the Royal Statistical<br />
Society series B 56, 409–456.<br />
<br />
==Project 16 : Feature extraction in news article clustering==<br />
===By Gobaan Raveendran & Daniel Nicoara ===<br />
Our project will focus on crawling the internet for various news articles from many different sources and then classifying these into sets of either left or right wing blogs. The projects focus will be on feature extraction, and determining which features are important for classification. <br />
<br />
For supervised data, we will either automatically assign classes based on domain and see if the article fits in the predicted domain, or we will use an external system such as a topic model.<br />
<br />
==Project 17: Classification of harp seals==<br />
===By Zhikang Huang, Haoyang Fu, Mengfei Yang===<br />
Seals possess varied repertoires of underwater vocalisations. Geographic variation in call types have been reported for Weddell and bearded seal species, and the variations have been attributed to the isolation of breeding populations within these species.<br />
<br />
Our project will focus on harp seals (Phoca groenlandica), and in particular the herds from Jan Mayen Island, Gulf of St. Lawrence and Front. Our goal is to classify harp seals using the data which were obtained from underwater recordings of harp seals in these three herds. Nine hundred calls from each of the three herds are used to be our training set, and we will use three hundred calls as our test set to estimate our predictive model. <br />
<br />
We will use tree models or logistic regression models as our predictive model to do the classification, and we will select the best model with the smallest error rate. Also, our model will include seven variables which are as follows:<br />
<br />
'''ELEMDUR''' - this is the duration of a single element of a harp seal underwater vocalisation. It is measured in milliseconds.<br />
<br />
'''INTERDUR''' - this is the time between elements in multiple element calls. It is measured in milliseconds. Note that not all calls have multiple elements so this variable is absent in single element calls. Where absent, a value of NA is recorded in the data.<br />
<br />
'''NO_ELEM''' - this is the number of elements of the call. In harp seals all of the elements within a single call are similar and the spacing between them is constant.<br />
<br />
'''STARTFREQ''' - this is the pitch at the start of the call or the highest pitch if the call has an extremely short duration (call shape 0 below).<br />
<br />
'''ENDFRE''' - this is the pitch at the end of the call or the lowest pitch if the call has an extremely short duration (call shape 0).<br />
<br />
'''WAVEFORM''' - this codes a series of waveform shapes (a plot of amplitude vs time) which lie more or less along a continuum. The waveform shapes <br />
are: <br />
frequency modulated sinusoidal 9<br />
slight frequency modulated and complex 8<br />
sinusoidal (pure tone) 7<br />
complex (irregular waveform) 5<br />
amplitude pulses 4 burst pulses 3<br />
knock (short burst pulse) 2<br />
click (very short duration) 1 <br />
<br />
'''CALLSHAP''' - this codes a series of call shapes as they would appear in a sonogram spectral analysis (a plot of frequency vs time). <br />
<br />
'''HERD''' - this is the herd from which the recordings were obtained. The classification recording of herds is as follows:<br />
Jan Mayen Island 1 <br />
Gulf of St. Lawrence 2 <br />
Front 3<br />
<br />
'''Reference'''<br />
<br />
Terhune, J.M. (1994) Geographical variation of harp seal underwater vocalisations, Can. J. Zoology 72(5) 892-897.<br />
<br />
Statistics Society of Canada, http://www.ssc.ca/en/education/archived-case-studies/seal-vocalisations.<br />
<br />
==Project 18 : Classifying Vehicle License Plates ==<br />
===By : Jun Kai Shan, Su Rong You===<br />
Our group will focus on using statistical methods to classify Canadian vehicle license plates. We will use MATLAB to find a way to classify letters, digits and province of the license plates. We will take pictures of the license plates of different cars and use the pictures as our data. We will use R to do further statistical analysis.<br />
<br />
<br />
==Project 19 : Ice/No-Ice Classification==<br />
===By : Steven Leigh===<br />
Modern satellites collect massive amounts of earth imagery limiting the usefulness of humans for image interpretation. This project will attempt to tackle the problem of identifying ice and open water automatically from satellite imagery. Multimodal data will be considered such as multipolar SAR data, optical data and thematic data to name a few.<br />
<br />
==Project 20: A survey on Support Vector Machine==<br />
===By Monsef Tahir===<br />
The support vector machine is a training algorithm for learning classification and regression rules from data, for example the SVM can be used to learn polynomial, radial basis function (RBF) and multi-layer perceptron (MLP) classifiers. SVMs are based on the structural risk minimisation principle, closely related to regularisation theory. This principle incorporates capacity control to prevent over-fitting and thus is a partial solution to the bias-variance trade-off dilemma.<br />
<br />
In this project, a survey will be made on SVM and a comparison with other tools in terms of classification and prediction will be performed as well.<br />
<br />
<br />
==Project 21: Good/Bad-day Classification==<br />
===By Carl J. Wensater===<br />
The proposed project is to over the course of two weeks gather and parameterize data about daily activities such as workout, food intake, hours of sleep, amount of spare time etc. This data will be used as training data for a two class classifier that will try to distinguish good from bad days. If the classification is successful statistical analysis can be used to identify the crucial components of a good day.<br />
<br />
<br />
==Project 22: SVM-Based Classification of Peer-to-Peer Internet Traffic==<br />
===By: Talieh Seyed Tabatabaei===<br />
<br />
In recent years, Peer-to-Peer (P2P) file-exchange applications have overtaken Web applications as the major contributor of traffic on the Internet. Recent estimates put the volume of P2P traffic at 70% of the total broadband traffic. P2P is often used for illegally sharing copyrighted music, video, games, and software. The legal ramification of this traffic combined with its aggressive use of network resources has necessitated a strong need for identification of network traffic by application type. This task, referred to as traffic classification, is a pre-requisite to many network management and traffic engineering problems.<br />
<br />
In this project least squared support vector machines (LS-SVM) is going to be adopted in order to identify p2p traffic using some flow-based statistical features.<br />
<br />
==Project 23: A survey on a fast learning algorithm on deep belief nets==<br />
===By: Seyed Seifi===<br />
<br />
Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution<br />
of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all<br />
of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases.<br />
<br />
I am trying to have a survey on the novel learning algorithm which is proposed by Prof. Geoffrey E. Hinton on deep belief nets.<br />
<br />
<br />
http://www.cs.toronto.edu/~hinton/<br />
<br />
http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf<br />
<br />
==Project 24: Feature Extraction using SVM==<br />
===By: Ad Tayal===<br />
<br />
Explore the idea of using support vector machines for joint feature extraction and classification.<br />
<br />
==Project 25: Applying distribution matching to a reinforcement learning problem==<br />
===By: Beomjoon Kim===<br />
<br />
In this project, I will try to correct the sample selection bias in the famous reinforcement learning problem, known as "Tiger Problem", using distribution matching.<br />
<br />
<br />
==Project 26: Identifying a person in pictures==<br />
</noinclude><br />
===By: Yan(Serena) Sun, Jorge Munoz, Cyrus Wu, Baozhe Chen ===<br />
<br />
<br />
For our classification project, we are trying to use classification methods to identify a person in different pictures. The information given is: his/her hair color, shirt color, pants color, standing up or sitting down. Based on this information, we divide the original picture into smaller pixels (maybe 100*100 small blocks) and divide the picture of the person given into smaller pixels with same sizes as the pixels of the original picture. We try to identify the pixels with the same pattern and colors from the picture, which would be the person we try to look for from the pictures. We can compare what we find by the algorithm with the real person to calculate the misclassified rate. <br />
The possible software and tools include: R, Matlab.<br />
<br />
==Project 27: Artificial Neural Networks to predict extreme losses in Financial Time Series==<br />
===By: Justin Francis ===<br />
<br />
A common issue in finance modelling is being able to predict when an extreme loss is about to occur. Most time series exhibit a level of autocorrelation, meaning there should be signals in the data for this extreme loss before it happens. I will experiment with a few artificial networks and train them on historical market index returns to see how well these phenomena can be predicted.<br />
<br />
<br />
==Project 28: An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)==<br />
===By: Grace Tompkins and Tatiana Krikella ===<br />
<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction. <br />
<br />
We will use R for analysis.</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841proposal&diff=42770f11Stat841proposal2020-10-11T18:11:48Z<p>Gtompkin: </p>
<hr />
<div></noinclude><br />
==Project 1 : Classification of Disease Status== <br />
===By: Lai,ChunWei and Greg Pitt===<br />
For our classification project, we are proposing an application in the<br />
medical diagnosis field: For each patient or lab animal, there will<br />
be results from a large number of genetic and/or chemical tests. We<br />
should be able to predict the disease state of the patient/animal,<br />
based on the presence or absence of certain biomarkers and/or chemical<br />
markers.<br />
<br />
Our project work will include the reduction of dimensionality, and the<br />
development or one or more classification functions, with the<br />
objectives of minimizing the error rate and also reducing the number<br />
of markers required in order to make good predictions. Our results<br />
could be used at the patient level, to help make accurate diagnoses,<br />
and at the population health level, to make epidemiological surveys of<br />
the prevalence of certain medical conditions. In both cases, the<br />
results should enable the healthcare system to make better decisions<br />
regarding the deployment of scarce healthcare resources.<br />
<br />
Our methodology will be chosen soon, after we have seen a few more<br />
examples in class. If time permits, we will also attempt a novel<br />
classification procedure of our own design.<br />
<br />
Currently we have access to a dataset from the SSC data mining<br />
section, and we hope to be able to get access to some similar, but<br />
larger, datasets before the end of the term.<br />
<br />
The software tools that we use will probably include Matlab, Python, and R.<br />
<br />
We would like to obtain publishable results if possible, but this is<br />
not a primary objective.<br />
<br />
<noinclude><br />
<br />
<br />
<br />
==Proposal 2: The Golden Retrieber==<br />
<br />
===By Cameron Davidson-Pilon and Jenn Smith===<br />
<br />
Our goal of this project is to determine statistical results from the population of Twitter users that have a specific celebrity in their display picture. Our algorithm will scan through Twitter's display pictures, and attempt to determine whether a display picture features Canada's most famous icon: '''Justin Beiber'''. We will hope that most images contain his trademark swoosh hairstyle, as much of or classification will rely on such handsome features. <br />
<br />
After we determine, with some probability of error, that a user has a Beiber Display Picture (BDP), we can then do a statistical analysis on the sample population's tweets, hashtags etc. <br />
<br />
Applications of this algorithm include the Twitter behaviour of Bieber fans. It can be used in an app for companies that want to target such demographics. <br />
<br />
We will be using Matlab and Python.<br />
<br />
==Project 3 : Classifying Melanoma Images ==<br />
</noinclude><br />
===By: Robert Amelard===<br />
Currently, the method of manually diagnosing melanoma is very subjective. Dermatologists essentially look at a skin lesion and determine from their experience if it looks malignant or benign. Some popular methods for diagnosis are the 7-point checklist and the ABCD rubric. They are both based on very subjective criteria, such as the "irregularity" of a skin lesion. My project will attempt to classify an input image containing a skin lesion into the class "melanoma" or "not melanoma" based on features that are regarded as high risk with regard to these rubrics. This will help doctors come to more a quantitative, objectively justifiable diagnosis of patients.<br />
<br />
<br />
==Project 4 : application of ANN on handwritten character recognition ==<br />
</noinclude><br />
===By: Chen Wang; YuanHong Yu; Jia Zhou===<br />
<br />
Hand-printed names are considered as one of some major approaches to authenticate a person’s identity. Handwritings of different person are different; therefore it is very difficult to recognize the handwritten characters. Handwritten Character recognition is an area of pattern recognition that has become the subject of research. We apply the ANN to the collected images of handwritten English characters to match the 26 letters of alphabet. Finally, the performance is evaluated by running with test samples.<br />
<br />
The possible software and tools we would like to use include: R, Matlab.<br />
<br />
==Project 5 : Distributed Classification and Data Fusion in Wireless Sensor Networks==<br />
</noinclude><br />
===By : Mahmoud Faraj===<br />
Wireless sensor networks (WSNs) are a recently emerging technology consisting of a large number of battery powered sensor nodes interconnected wirelessly and capable of monitoring environments, tracking targets, and performing many other critical applications. The design and deployment of such type of network are challenging tasks due to the imperfect nature of the communicated nodes (i.e., sensors) in the WSNs. The dramatic depletion of the sensor’s energy while performing the regular tasks (e.g. sensing, processing, receiving and transmitting information) constitutes a major threat of shortening the lifetime of the network. That is due to the limited amount of energy in the sensor which is constrained by the dimensions of these sensors. The lack of energy makes the lifetime of the network shorter. Also, the death of some nodes causes partitioning the network. As a result, some nodes become not able to communicate with others to accomplish the ultimate goal of the remotely deployed network.<br />
<br />
In our research work, we propose one of the techniques learned in the course to be used for performing distributed classification of a moving target (e.g. vehicle, animal, or person). Each sensor node will be able to classify the moving target and then track it in the WSN field. In order to conserve power and extend the lifetime of the network, we also propose distributed (in-network) data fusion by using Distributed Kalman Filter where the data are fused in the network instead of having all the data transmitted to the fusion center (sink). Each node processes the data from its own set of sensor, communicating with neighbouring nodes to improve the classification and location estimates of the moving target. Simulation results will be provided to demonstrate the significant advantages of using distributed classification and data fusion and also to show the improvement of the WSN as a whole.<br />
<br />
==Project 6 : Skin Classification==<br />
</noinclude><br />
===By : Jeffrey Glaister===<br />
My goal for this project is to classify segments of an oversegmented image as skin or skin lesion (a two-class problem). The overall goal and application is to automatically segment the skin lesion in pictures of patients at risk of melanoma, unsupervised. A standard segmentation algorithm will be applied to the image to oversegment it. Then, the resulting segments will have classified as normal skin or not normal skin and be merged. Of particular interest is texture and colour classification, since skin and lesions differ slightly in texture and colour. Other possible features include spatial location and initial segment size. Time permitting, novel texture and colour classifiers will be investigated. <br />
<br />
I have access to skin lesion images from a public database, some of which have been manually contoured to test the algorithm. <br />
<br />
I will be using Matlab.<br />
<br />
==Project 7 : Sentiment Analysis Controversial Topic==<br />
</noinclude><br />
===By : Samson Hu, Blair Rose, Mikhail Targonski===<br />
Classify documents whether they are for against a specific topic. <br />
<br />
The possible software and tools include but not limited to: Matlab, R.<br />
<br />
==Project 8 : Reproducing the Results of a 1985 Paper ==<br />
</noinclude><br />
===By : Mohamed El Massad===<br />
For the CM 763 final project, I intend to reproduce the results of Keller‘s 1985 IEEE paper in which the Fuzzy K-Nearest Neighbour algorithm was first introduced. To develop the algorithm, the authors of the paper introduced the theory of fuzzy-sets into the well-known K-nearest neighbour decision rule. Their motivation was to address a limitation with the said rule that it gives the training samples equal weight in deciding the class memberships of the patterns to be classified, regardless of the degree to which these samples are representative of their own classes. The authors proposed three methods for assigning initial fuzzy memberships to the samples in the training data set, and presented experimental results and comparisons to the crisp version of the algorithm showing how their proposed one outperforms it in terms of the error rate. They also show that their algorithm compares well against other more sophisticated pattern classification procedures, including the Bayes classifier and the perceptron. Finally, they develop a fuzzy analog to the nearest prototype algorithm.<br />
<br />
The authors of the paper used FORTRAN to implement their proposed algorithms, but I will probably use MATLAB to do that and maybe some C as well.<br />
<br />
==Project 9 : Learning from Weak Teachers ==<br />
</noinclude><br />
===By : Nika Haghtalab===<br />
<br />
This work addresses the problem of learning when labeled data varies in quality depending on the quality of the teacher. In many scenarios we have access to "weak" but readily available teachers, but restricted access "stronger" teachers.<br />
<br />
We attempt to extend the current framework of such learning scenarios, we attempt to introduce a hierarchy of teachers with different label qualities. We will focus on how an output classifier should be computed from labels of such group of teachers. We are also interested in deciding when it is economical to refer to an instance to a higher quality teachers in order to increase the performance of the classifier.<br />
<br />
==Project 10 : Classification of 3-dimensional objects ==<br />
</noinclude><br />
===By : Kenneth Webster, Soo Min Kang, and Hang Su===<br />
<br />
Solvability of 3-dimensional physical systems.<br />
Can computers determine whether a jumbled rubix cube can be solved?<br />
<br />
<br />
== Project 11: Learning in Robots ==<br />
</noinclude><br />
=== By: Guoting (Jane) Chang ===<br />
<br />
<br />
'''Background'''<br />
<br />
One of the long term goals in robotics is for robots (such as humanoid robots) to become useful in human environments. In order for robots to be able to perform different services within society, they will need the ability to carry out new tasks and to adapt to changing environments. This in turn requires robots to have a capacity for learning. However, existing implementations of learning in robots tend to focus on specific tasks and are not easily extended to different tasks and environments [1].<br />
<br />
'''Proposed Project'''<br />
<br />
The purpose of the proposed work is to continue developing an initial framework for a learning system that is not task or environment specific. Such a generalized learning strategy should be achievable through hierarchical knowledge abstraction and appropriate knowledge representation. At the lowest level of the hierarchy, vision techniques will be used to extract features (such as colors, contours and position information) from raw input video data. On the next level of the hierarchy,<br />
the extracted features will be combined using clustering techniques such as self-organizing maps to perform object recognition. Furthermore, in order to learn to recognize motions shown in the videos, techniques such as incremental decision trees should be investigated for performing guided clustering (i.e., clustering based on some metric). At the higher levels of the hierarchy, the sequence of motions and objects involved in the video should be represented using connectionist models such as directed graphs. <br />
<br />
The main focus of the proposed work for this project will be on the clustering of observed motions, as it is most closely related to the classification techniques that will be taught in class. An incremental decision tree is tentatively being considered for this, as the goal is to determine whether a newly observed motion belongs to a group of motions that has been seen before or whether it is a new motion and the knowledge representation should be updated to include it. Matlab or C/C++ code will most likely be used for this project.<br />
<br />
'''Reference'''<br />
<br />
[1] A. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in ''Third IEEE International Conference on Development and Learning.'' San Diego, California, USA: IEEE, 2004.<br />
<br />
<br />
== Project 12: Stock price forecasting ==<br />
</noinclude><br />
=== By: Zhe (Gigi) Wang, Chi-Yin (Johnny) Chow ===<br />
<br />
'''Proposal'''<br />
<br />
Under Efficient Market Hypothesis, stock prices are completely unpredictable and any information that are publicly available (weak-form of EMH) is already reflected in the stock prices. However, an experienced trader can have a "feel" or prediction of the prices in the future, based on the history of prices or other factors.<br />
<br />
In this project, we will apply component analysis techniques to identify or recognize any significant patterns and then employ the Support Vector Machine technique as the prediction model. After applying the model to the data, we plan to evaluate the accuracy of the prediction, and compare it with other state-of-the-art techniques.<br />
<br />
'''References'''<br />
<br />
[1] Ince, H., Trafalis, T.B., "Kernel principal component analysis and support vector machines for stock price prediction", Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference.<br />
<br />
[2] Chi-Jie Lu, Jui-Yu Wu, Cheng-Ruei Fan ; Chih-Chou Chiu, "Forecasting stock price using Nonlinear independent component analysis and support vector regression", Industrial Engineering and Engineering Management, 2009. IEEM 2009. IEEE International Conference.<br />
<br />
==Project 13: UFO Sightings ==<br />
</noinclude><br />
===By Vishnu Pothugunta===<br />
<br />
There have been a lot of UFO sightings in the past decade. The goal is to use classification methods and predict where and when could the UFO sightings happen. From the past data about the ufo sightings, we can also try to predict the shape of the UFO and the duration of the sighting.<br />
<br />
==Project 14: Identifying Accounting Fraud Using Statistical Learning ==<br />
===By Daniel Severn===<br />
<br />
'''Proposal'''<br />
<br />
By constructing a data set from key financial ratios from financial statements I hope to create a statistical classifier that can accurately identify the companies who engage in accounting fraud. The following is a paper with a similar goal. http://www.waset.org/journals/ijims/v3/v3-2-13.pdf. I would like to use methods from class but perhaps also the C4.5 method as in the previously linked paper it provided a greatly superior classifier.<br />
<br />
'''Relevance'''<br />
<br />
This is of obvious relevance to securities commissions but it also relevant useful to any investor. By adding such a classifier to some investors typical methods they can identify suspect companies and profit from the high probability of a stock decline when the fraud is uncovered. This in turn will reduce the stock price of companies that engage in heavy management and manipulation of their financial statements. This would reduce/remove the incentive of managing/manipulating financial statements which benefits the entire financial system and thus economy.<br />
<br />
'''Challenges'''<br />
<br />
This analysis requires a quality data set. Finding/creating such a data set may be challenging.<br />
<br />
<br />
==Project 15 : A survey on artificial neural networks (ANN)==<br />
</noinclude><br />
===By: Hojat Abdolanezhad & Carolyn Augusta===<br />
<br />
A brief history and the function of ANN, explanation of common terms used in ANN (perceptron, back propagation, sematic net, etc.)<br />
and general philosophy of neural networks. types of ANNs researched in the past, leading into the present. We will mention <br />
new trends in this area. Also an application of neural networks in classification will be discussed.<br />
<br />
'''A note'''<br />
Artificial neural networks (ANN) have been very useful to solve real world problems. In Economics, ANNs can be applied to predict the profit, market<br />
trends, and price levels based on the market’s databases from the past. In industry, engineers can apply ANNs to solve many nonlinear engineering problems such as classifications, prediction, pattern recognition, where the tasks are very difficult to solve using normal mathematical tools.<br />
<br />
'''Useful papers:'''<br />
<br />
H. White, “Learning in artificial neural networks: A statistical<br />
perspective,” Neural Comput., vol. 1, pp. 425–464, 1989.<br />
<br />
E. Wan, “Neural network classification: A Bayesian interpretation,”<br />
IEEE Trans. Neural Networks, vol. 1, no. 4, pp. 303–305, 1990.<br />
<br />
P.G. Zhang, “Neural Networks for Classification: A Survey,” IEEE<br />
Trans. Systems, Man, and Cybernetics, vol. 30, no. 4, pp. 451-462,<br />
2000. Christopher Bishop. Neural Networks for Pattern Recognition.<br />
Oxford University Press, London, UK, 1995<br />
<br />
Ripley, B. D. (1994a) Neural networks and related methods for<br />
classification (with discussion). Journal of the Royal Statistical<br />
Society series B 56, 409–456.<br />
<br />
==Project 16 : Feature extraction in news article clustering==<br />
===By Gobaan Raveendran & Daniel Nicoara ===<br />
Our project will focus on crawling the internet for various news articles from many different sources and then classifying these into sets of either left or right wing blogs. The projects focus will be on feature extraction, and determining which features are important for classification. <br />
<br />
For supervised data, we will either automatically assign classes based on domain and see if the article fits in the predicted domain, or we will use an external system such as a topic model.<br />
<br />
==Project 17: Classification of harp seals==<br />
===By Zhikang Huang, Haoyang Fu, Mengfei Yang===<br />
Seals possess varied repertoires of underwater vocalisations. Geographic variation in call types have been reported for Weddell and bearded seal species, and the variations have been attributed to the isolation of breeding populations within these species.<br />
<br />
Our project will focus on harp seals (Phoca groenlandica), and in particular the herds from Jan Mayen Island, Gulf of St. Lawrence and Front. Our goal is to classify harp seals using the data which were obtained from underwater recordings of harp seals in these three herds. Nine hundred calls from each of the three herds are used to be our training set, and we will use three hundred calls as our test set to estimate our predictive model. <br />
<br />
We will use tree models or logistic regression models as our predictive model to do the classification, and we will select the best model with the smallest error rate. Also, our model will include seven variables which are as follows:<br />
<br />
'''ELEMDUR''' - this is the duration of a single element of a harp seal underwater vocalisation. It is measured in milliseconds.<br />
<br />
'''INTERDUR''' - this is the time between elements in multiple element calls. It is measured in milliseconds. Note that not all calls have multiple elements so this variable is absent in single element calls. Where absent, a value of NA is recorded in the data.<br />
<br />
'''NO_ELEM''' - this is the number of elements of the call. In harp seals all of the elements within a single call are similar and the spacing between them is constant.<br />
<br />
'''STARTFREQ''' - this is the pitch at the start of the call or the highest pitch if the call has an extremely short duration (call shape 0 below).<br />
<br />
'''ENDFRE''' - this is the pitch at the end of the call or the lowest pitch if the call has an extremely short duration (call shape 0).<br />
<br />
'''WAVEFORM''' - this codes a series of waveform shapes (a plot of amplitude vs time) which lie more or less along a continuum. The waveform shapes <br />
are: <br />
frequency modulated sinusoidal 9<br />
slight frequency modulated and complex 8<br />
sinusoidal (pure tone) 7<br />
complex (irregular waveform) 5<br />
amplitude pulses 4 burst pulses 3<br />
knock (short burst pulse) 2<br />
click (very short duration) 1 <br />
<br />
'''CALLSHAP''' - this codes a series of call shapes as they would appear in a sonogram spectral analysis (a plot of frequency vs time). <br />
<br />
'''HERD''' - this is the herd from which the recordings were obtained. The classification recording of herds is as follows:<br />
Jan Mayen Island 1 <br />
Gulf of St. Lawrence 2 <br />
Front 3<br />
<br />
'''Reference'''<br />
<br />
Terhune, J.M. (1994) Geographical variation of harp seal underwater vocalisations, Can. J. Zoology 72(5) 892-897.<br />
<br />
Statistics Society of Canada, http://www.ssc.ca/en/education/archived-case-studies/seal-vocalisations.<br />
<br />
==Project 18 : Classifying Vehicle License Plates ==<br />
===By : Jun Kai Shan, Su Rong You===<br />
Our group will focus on using statistical methods to classify Canadian vehicle license plates. We will use MATLAB to find a way to classify letters, digits and province of the license plates. We will take pictures of the license plates of different cars and use the pictures as our data. We will use R to do further statistical analysis.<br />
<br />
<br />
==Project 19 : Ice/No-Ice Classification==<br />
===By : Steven Leigh===<br />
Modern satellites collect massive amounts of earth imagery limiting the usefulness of humans for image interpretation. This project will attempt to tackle the problem of identifying ice and open water automatically from satellite imagery. Multimodal data will be considered such as multipolar SAR data, optical data and thematic data to name a few.<br />
<br />
==Project 20: A survey on Support Vector Machine==<br />
===By Monsef Tahir===<br />
The support vector machine is a training algorithm for learning classification and regression rules from data, for example the SVM can be used to learn polynomial, radial basis function (RBF) and multi-layer perceptron (MLP) classifiers. SVMs are based on the structural risk minimisation principle, closely related to regularisation theory. This principle incorporates capacity control to prevent over-fitting and thus is a partial solution to the bias-variance trade-off dilemma.<br />
<br />
In this project, a survey will be made on SVM and a comparison with other tools in terms of classification and prediction will be performed as well.<br />
<br />
<br />
==Project 21: Good/Bad-day Classification==<br />
===By Carl J. Wensater===<br />
The proposed project is to over the course of two weeks gather and parameterize data about daily activities such as workout, food intake, hours of sleep, amount of spare time etc. This data will be used as training data for a two class classifier that will try to distinguish good from bad days. If the classification is successful statistical analysis can be used to identify the crucial components of a good day.<br />
<br />
<br />
==Project 22: SVM-Based Classification of Peer-to-Peer Internet Traffic==<br />
===By: Talieh Seyed Tabatabaei===<br />
<br />
In recent years, Peer-to-Peer (P2P) file-exchange applications have overtaken Web applications as the major contributor of traffic on the Internet. Recent estimates put the volume of P2P traffic at 70% of the total broadband traffic. P2P is often used for illegally sharing copyrighted music, video, games, and software. The legal ramification of this traffic combined with its aggressive use of network resources has necessitated a strong need for identification of network traffic by application type. This task, referred to as traffic classification, is a pre-requisite to many network management and traffic engineering problems.<br />
<br />
In this project least squared support vector machines (LS-SVM) is going to be adopted in order to identify p2p traffic using some flow-based statistical features.<br />
<br />
==Project 23: A survey on a fast learning algorithm on deep belief nets==<br />
===By: Seyed Seifi===<br />
<br />
Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution<br />
of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all<br />
of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases.<br />
<br />
I am trying to have a survey on the novel learning algorithm which is proposed by Prof. Geoffrey E. Hinton on deep belief nets.<br />
<br />
<br />
http://www.cs.toronto.edu/~hinton/<br />
<br />
http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf<br />
<br />
==Project 24: Feature Extraction using SVM==<br />
===By: Ad Tayal===<br />
<br />
Explore the idea of using support vector machines for joint feature extraction and classification.<br />
<br />
==Project 25: Applying distribution matching to a reinforcement learning problem==<br />
===By: Beomjoon Kim===<br />
<br />
In this project, I will try to correct the sample selection bias in the famous reinforcement learning problem, known as "Tiger Problem", using distribution matching.<br />
<br />
<br />
==Project 26: Identifying a person in pictures==<br />
</noinclude><br />
===By: Yan(Serena) Sun, Jorge Munoz, Cyrus Wu, Baozhe Chen ===<br />
<br />
<br />
For our classification project, we are trying to use classification methods to identify a person in different pictures. The information given is: his/her hair color, shirt color, pants color, standing up or sitting down. Based on this information, we divide the original picture into smaller pixels (maybe 100*100 small blocks) and divide the picture of the person given into smaller pixels with same sizes as the pixels of the original picture. We try to identify the pixels with the same pattern and colors from the picture, which would be the person we try to look for from the pictures. We can compare what we find by the algorithm with the real person to calculate the misclassified rate. <br />
The possible software and tools include: R, Matlab.<br />
<br />
==Project 27: Artificial Neural Networks to predict extreme losses in Financial Time Series==<br />
===By: Justin Francis ===<br />
<br />
A common issue in finance modelling is being able to predict when an extreme loss is about to occur. Most time series exhibit a level of autocorrelation, meaning there should be signals in the data for this extreme loss before it happens. I will experiment with a few artificial networks and train them on historical market index returns to see how well these phenomena can be predicted.<br />
<br />
<br />
==Project 28: TBD==<br />
===By: Grace Tompkins and Tatiana Krikella ===<br />
<br />
TBD</div>Gtompkinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841presentation&diff=42537f11Stat841presentation2020-10-06T16:49:36Z<p>Gtompkin: /* Schedule of Project Presentations */</p>
<hr />
<div>== Schedule of Project Presentations ==<br />
Please write your project# and the name of group members<br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="2"<br />
|-<br />
|width="100pt"|Date<br />
|width="333pt"|First Presentation (project# and group members)<br />
|width="333pt"|Second Presentation (project# and group members)<br />
|width="333pt"|Third Presentation (project# and group members)<br />
|width="333pt"|Forth Presentation (project# and group members)<br />
|-<br />
|Nov 15 ||Project #|| Project #6 by Jeff Glaister || Project #3 Grace Tompkins, Tatianna Krikella, Swaleh Hussain || Project 24 Ad Tayal<br />
|-<br />
|Nov 17 ||Project #7 Samson Hu, Blair Rose, Mikhail Targonski || Project 1 by Lai,Chunwei & Greg Pitt || Project # by Jun Kai Shan & Su Rong You || Project #4 by Jia Zhou,Chen Wang, YuanHong Yu ||Project 20 by M,Tahir <br />
|-<br />
|Nov 22 ||Project # 8 by Mohamed ElMassad|| Project #19 by Steven Leigh|| Project # 2 by Cameron Davidson-Pilon & Jennifer Smith|| Project # 17 by Zhikang Huang & Haoyang Fu &Mengfei Yang ||Project #23 Seyed Seifi<br />
|-<br />
|Nov 24 ||Project # 10 by Kenneth Webster, Soo Min Kang, and Hang Su|| Project #22 by Talieh Seyed Tabatabaei || Project #11 by Guoting (Jane) Chang || Project # 15 Hojat Abdolanezhad & Carolyn Augusta || Project # 21 by Carl J. Wensater ||<br />
|-<br />
|Nov 29 ||Project # 25 by Beomjoon Kim|| Project # 9 by Nika Haghtalab || Project # 12 by Chi-Yin (Johnny) Chow, Zhe (Gigi) Wang|| Project # 13 by Vishnu Pothugunta||Project #26 by Yan(Serena) Sun, Jorge Munoz, Baozhe Chen, Cyrus Wu ||<br />
|-<br />
|Dec 1 ||Project # 3 by Robert Amelard || Project # 14 by Daniel Severn || Project #16 by Gobaan Raveendran & Daniel Nicoara || Project # 5 by Mahmoud Faraj || Project #27 by Justin Francis</div>Gtompkin