# Difference between revisions of "stat841f10"

(→Applications of Support Vector Machines) |
m (Conversion script moved page Stat841f10 to stat841f10: Converting page titles to lowercase) |
||

(550 intermediate revisions by 26 users not shown) | |||

Line 41: | Line 41: | ||

As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>. | As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>. | ||

+ | |||

+ | === Examples of Classification === | ||

+ | |||

+ | • Email spam filtering (spam vs not spam). | ||

+ | |||

+ | • Detecting credit card fraud (fraudulent or legitimate). | ||

+ | |||

+ | • Face detection in images (face or background). | ||

+ | |||

+ | • Web page classification (sports vs politics vs entertainment etc). | ||

+ | |||

+ | • Steering an autonomous car across the US (turn left, right, or go straight). | ||

+ | |||

+ | • Medical diagnosis (classification of disease based on observed symptoms). | ||

+ | |||

+ | === Independent and Identically Distributed (iid) Data Assumption === | ||

+ | |||

+ | Suppose that we have training data X which contains N data points. The Independent and Identically Distributed (IID) assumption declares that the datapoints are drawn independently from identical distributions. This assumption implies that ordering of the data points does not matter, and the assumption is used in many classification problems. For an example of data that is not IID, consider daily temperature: today's temperature is not independent of the yesterday's temperature -- rather, there is a strong correlation between the temperatures of the two days. | ||

=== Error rate === | === Error rate === | ||

Line 53: | Line 71: | ||

In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate. | In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate. | ||

+ | |||

+ | An Error Rate Comparison of Classification Methods [http://pdfserve.informaworld.com/311525_770885140_713826662.pdf] | ||

+ | |||

+ | === Decision Theory === | ||

+ | we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by: | ||

+ | |||

+ | a. First solve the inference problem of determining the class-conditional densities <math>\ p(x|C_k)</math> for each class <math>\ C_k</math> individually. Also separately infer the prior class probabilities <math>\ p(C_k)</math>. Then use Bayes’ theorem in the form | ||

+ | |||

+ | <math>\begin{align}p(C_k|x)=\frac{p(x|C_k)p(C_k)}{p(x)} \end{align}</math> | ||

+ | |||

+ | to find the posterior class probabilities <math>\ p(C_k|x)</math>. As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because | ||

+ | |||

+ | <math>\begin{align}p(x)=\sum_{k} p(x|C_k)p(C_k) \end{align}</math> | ||

+ | |||

+ | Equivalently, we can model the joint distribution <math>\ p(x, C_k)</math> directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input <math>\ x</math>. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as "generative models", because by sampling from them it is possible to generate synthetic data points in the input space. | ||

+ | |||

+ | b. First solve the inference problem of determining the posterior class probabilities <math>\ p(C_k|x)</math>, and then subsequently use decision theory to assign each new <math>\ x</math> to one of the classes. Approaches that model the posterior probabilities directly | ||

+ | are called "discriminative models". | ||

+ | |||

+ | c. Find a function <math>\ f(x)</math>, called a discriminant function, which maps each input <math>\ x</math> directly onto a class label. For instance, in the case of two-class problems, <math>\ f(.)</math> might be binary valued and such that <math>\ f = 0</math> represents class <math>\ C_1</math> and <math>\ f = 1</math> represents class <math>\ C_2</math>. In this case, probabilities play no role. | ||

+ | |||

+ | This topic has brought to you from Pattern Recognition and Machine Learning by Christopher M. Bishop (Chapter 1) | ||

=== Bayes Classifier === | === Bayes Classifier === | ||

Line 200: | Line 240: | ||

More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here]. | More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here]. | ||

+ | |||

+ | There is useful information about Machine Learning, Neural and Statistical Classification in this link [http://www.amsta.leeds.ac.uk/~charles/statlog/] Machine Learning, Neural and Statistical Classification; there is some description of Classification in chapter 2 Classical Statistical Methods in chapter 3 and Modern Statistical Techniques in chapter 4. | ||

+ | |||

+ | === Extension: Statistical Classification Framework === | ||

+ | |||

+ | In statistical classification, each object is represented by 'd' (a set of features) a measurement vector, and the goal of classifier becomes finding compact and disjoint regions for classes in a d-dimensional feature space. Such decision regions are defined by decision rules that are known or can be trained. The simplest configuration of a classification consists of a decision rule and multiple membership functions; each membership function represents a class. The following figures illustrate this general framework. | ||

+ | |||

+ | [[File:cs1.png]] | ||

+ | |||

+ | Simple Conceptual Classifier. | ||

+ | |||

+ | [[File:cs2.png]] | ||

+ | |||

+ | [http://www.orfeo-toolbox.org/SoftwareGuide/SoftwareGuidech17.html#x44-2480011 Statistical Classification Framework] | ||

+ | |||

+ | |||

+ | The classification process can be described as follows: | ||

+ | |||

+ | A measurement vector is input to each membership function. | ||

+ | Membership functions feed the membership scores to the decision rule. | ||

+ | A decision rule compares the membership scores and returns a class label. | ||

== '''Linear and Quadratic Discriminant Analysis''' == | == '''Linear and Quadratic Discriminant Analysis''' == | ||

+ | |||

+ | ===Introduction=== | ||

+ | '''Linear discriminant analysis''' ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) and the related '''Fisher's linear discriminant''' are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. | ||

+ | |||

+ | LDA is also closely related to principal component analysis ([http://en.wikipedia.org/wiki/Principal_component_analysis PCA]) and [http://en.wikipedia.org/wiki/Factor_analysis factor analysis] in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. | ||

+ | |||

+ | LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is '''discriminant correspondence analysis'''. | ||

+ | |||

+ | === Content === | ||

First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn. | First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn. | ||

Line 239: | Line 309: | ||

* LDA may over-fit the training data. | * LDA may over-fit the training data. | ||

− | == | + | The following link provides a comparison of discriminant analysis and artificial neural networks [http://www.jstor.org/stable/2584434?seq=4] |

+ | |||

+ | ====Different Approaches to LDA==== | ||

+ | Data sets can be transformed and test vectors can be classified in the transformed space by two | ||

+ | different approaches. | ||

− | + | *Class-dependent transformation: This type of approach involves maximizing the ratio of between | |

+ | class variance to within class variance. The main objective is to maximize this ratio so that adequate | ||

+ | class separability is obtained. The class-specific type approach involves using two optimizing criteria | ||

+ | for transforming the data sets independently. | ||

− | + | *Class-independent transformation: This approach involves maximizing the ratio of overall variance | |

+ | to within class variance. This approach uses only one optimizing criterion to transform the data sets | ||

+ | and hence all data points irrespective of their class identity are transformed using this transform. In | ||

+ | this type of LDA, each class is considered as a separate class against all other classes. | ||

− | + | == Further reading == | |

+ | The following are some applications that use LDA and QDA: | ||

− | The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: | + | 1- Linear discriminant analysis for improved large vocabulary continuous speech recognition [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=225984 here] |

+ | |||

+ | 2- 2D-LDA: A statistical linear discriminant analysis for image matrix [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V15-4DK6B5P-4-1&_cdi=5665&_user=1067412&_pii=S0167865504002272&_origin=search&_coverDate=04%2F01%2F2005&_sk=999739994&view=c&wchp=dGLzVlz-zSkzV&md5=60ea1cf7ff045f76421f5bde64bf855a&ie=/sdarticle.pdf here] | ||

+ | |||

+ | 3- Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V15-4DTJVF4-2-9&_cdi=5665&_user=1067412&_pii=S0167865504002260&_origin=search&_coverDate=01%2F15%2F2005&_sk=999739997&view=c&wchp=dGLzVtb-zSkzk&md5=1bba55e357b1c79579987638dcbf6828&ie=/sdarticle.pdf here] | ||

+ | |||

+ | 4- The sparse discriminant vectors are useful for supervised dimension reduction for high dimensional data. | ||

+ | Naive application of classical Fisher’s LDA to high dimensional, low sample size settings suffers from the data piling problem. In [http://www.iaeng.org/IJAM/issues_v39/issue_1/IJAM_39_1_06.pdf] they have use sparse LDA method which selects important variables for discriminant analysis and thereby | ||

+ | yield improved classification. Introducing sparsity in the discriminant vectors is very effective in eliminating data piling and the associated overfitting | ||

+ | problem. | ||

+ | |||

+ | == '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' == | ||

+ | |||

+ | ===LDA x QDA=== | ||

+ | |||

+ | Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>. | ||

+ | |||

+ | Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>. | ||

+ | |||

+ | The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: | ||

:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math> | :<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math> | ||

Line 347: | Line 447: | ||

A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>. | A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>. | ||

− | It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above. | + | It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above. This strategy is correct because by transforming <math>\, x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math> |

− | |||

− | |||

− | |||

− | |||

Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA . | Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA . | ||

Line 358: | Line 454: | ||

If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>? | If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>? | ||

− | |||

− | |||

− | |||

− | |||

The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> . | The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> . | ||

Line 369: | Line 461: | ||

:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>. | :: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>. | ||

− | :: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\, | + | :: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x_k^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>. |

− | :: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\, | + | :: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x_k^*</math> and the transformed center <math>\,\mu_k^*</math> of each class <math>\,k</math>, and assign <math>\,x</math> to class <math>\,k</math> such that the squared Euclidean distance between <math>\,x_k^*</math> and <math>\,\mu_k^*</math> is the least for all possible <math>\,k</math>'s. |

Line 398: | Line 490: | ||

[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]] | [[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]] | ||

+ | |||

+ | ===More information on Regularized Discriminant Analysis (RDA)=== | ||

+ | Discriminant analysis (DA) is widely used in classification problems. Except LDA and QDA, there is also an intermediate method between LDA and QDA, a regularized version of discriminant analysis (RDA) proposed by Friedman [1989], and it has been shown to be more flexible in dealing with various class distributions. RDA applies the regularization techniques by using two regularization parameters, which are selected to jointly maximize the classification performance. The optimal pair of parameters is commonly estimated via cross-validation from a set of candidate pairs. More detail about this method can be found in the book by Hastie et al. [2001]. On the other hand, the time of computing last long for high dimensional data, especially when the candidate set is large, which limits the applications of RDA to low dimensional data. In 2006, Ye Jieping and Wang Tie develop a novel algorithm for RDA for high dimensional data. It can estimate the optimal regularization parameters from a large set of parameter candidates efficiently. Experiments on a variety of datasets confirm the claimed theoretical estimate of the efficiency, and also show that, for a properly chosen pair of regularization parameters, RDA performs favourably in classification, in comparison with other existing classification methods. For more details, see Ye, Jieping; Wang, Tie | ||

+ | Regularized discriminant analysis for high dimensional, low sample size data Conference on Knowledge Discovery in Data: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 20-23 Aug. 2006 | ||

+ | |||

+ | ===Further Reading for Regularized Discriminant Analysis (RDA)=== | ||

+ | |||

+ | 1. Regularized Discriminant Analysis and Reduced-Rank LDA | ||

+ | [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda2.pdf] | ||

+ | |||

+ | 2. Regularized discriminant analysis for the small sample size in face recognition | ||

+ | [http://www.google.ca/url?sa=t&source=web&cd=2&sqi=2&ved=0CCQQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.84.6960%26rep%3Drep1%26type%3Dpdf&rct=j&q=Regularized%20Discriminant%20Analysis&ei=IPr2TJ_2MKWV4gaP5eH-Bg&usg=AFQjCNHB3fk6eVe5fSjlQCMfK44kU1-lug&sig2=5EJv_AV3W_ngSVFIa1nfRg&cad=rja.pdf] | ||

+ | |||

+ | 3. Regularized Discriminant Analysis and Its Application in Microarrays | ||

+ | [http://www-stat.stanford.edu/~hastie/Papers/RDA-6.pdf] | ||

== Trick: Using LDA to do QDA - September 28, 2010== | == Trick: Using LDA to do QDA - September 28, 2010== | ||

Line 414: | Line 521: | ||

Suppose we can estimate some vector <math>\underline{w}^T</math> such that | Suppose we can estimate some vector <math>\underline{w}^T</math> such that | ||

− | <math>y = \underline{w}^ | + | <math>y = \underline{w}^T\underline{x}</math> |

− | where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions). | + | where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">\underline{x}\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions). |

− | We also have a non-linear function <math>g(x) = y = x^ | + | We also have a non-linear function <math>g(x) = y = \underline{x}^Tv\underline{x} + \underline{w}^T\underline{x}</math> that we cannot estimate. |

− | Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that: | + | Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,\underline{x}^*</math> such that: |

<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math> | <math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math> | ||

Line 426: | Line 533: | ||

and | and | ||

− | <math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math> | + | <math>\underline{x}^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math> |

− | We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>. | + | We can then estimate a new function, <math>g^*(\underline{x},\underline{x}^2) = y^* = \underline{w}^{*T}\underline{x}^*</math>. |

− | Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA. | + | Note that we can do this for any <math>\, x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA. |

=== By Example === | === By Example === | ||

Line 469: | Line 576: | ||

:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points. | :Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points. | ||

+ | |||

+ | ===Working Example - Diabetes Data Set=== | ||

+ | |||

+ | Let's take a look at specific data set. This is a [http://archive.ics.uci.edu/ml/datasets/Diabetes diabetes data set] from the UC Irvine Machine Learning Repository. It is a fairly small data set by today's standards. The original data had eight variable dimensions. What I did here was to obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example. | ||

+ | |||

+ | The Diabetes data set has two types of samples in it. One sample type are healthy individuals the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk: | ||

+ | |||

+ | [[File:eq1.png]] | ||

+ | |||

+ | The first type has a prior probability estimated at 0.651. This means that among the data set, (250 to 300 data points), about 65% of these belong to class one and the other 35% belong to class two. Next, we computed the mean vector for the two classes separately:[[File:eq2.png]] | ||

+ | |||

+ | Then we computed [[File:eq3.jpg]] using the formulas discussed earlier. | ||

+ | |||

+ | Once we have done all of this, we compute the linear discriminant function and found the classification rule. Classification rule:[[File:eq4.jpg]] | ||

+ | |||

+ | In this example, if you give me an <math>\, x</math>, I then plug this value into the above linear function. If the result is greater than or equal to zero, I claim that it is in class one. Otherwise, it is in class two. | ||

+ | Below is a scatter plot of the dominant principle components. The two classes are represented. The first, without diabetes, is shown with red stars (class 1), and the second class, with diabetes, is shown with blue circles (class 2). The solid line represents the classification boundary obtained by LDA. It appears the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of indicator matrix. In this case, the results of the two different linear boundaries are very close. | ||

+ | |||

+ | [[File:eq5.jpg]] | ||

+ | |||

+ | It is always good practice to visualize the scheme to check for any obvious mistakes. | ||

+ | |||

+ | • Within training data classification error rate: 28.26%. | ||

+ | • Sensitivity: 45.90%. | ||

+ | • Specificity: 85.60%. | ||

+ | |||

+ | Below is the contour plot for the density of the diabetes data (the marginal density for <math>\, x</math> is a mixture of two Gaussians, 2 classes). It looks like a single Gaussian distribution. The reason for this is that the two classes are so close together that they merge into a single mode. | ||

+ | |||

+ | [[File:eq6.jpg]] | ||

=== LDA and QDA in Matlab === | === LDA and QDA in Matlab === | ||

Line 538: | Line 674: | ||

'''Recall: An analysis of the function of <code>princomp</code> in matlab.''' | '''Recall: An analysis of the function of <code>princomp</code> in matlab.''' | ||

− | <br />In our assignment 1, we | + | <br />In our assignment 1, we learned how to perform Principal Component Analysis using the SVD (Singular Value Decomposition) method. In fact, matlab offers a built-in function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which performs PCA. From the matlab help file on <code>princomp</code>, you can find the details about this function. Here we will analyze Matlab's <code>princomp()</code> code. We find something different than the SVD method we used on our first assignment. The following is Matlab's code for princomp followed by some explanations to emphasize some key steps. |

function [pc, score, latent, tsquare] = princomp(x); | function [pc, score, latent, tsquare] = princomp(x); | ||

Line 573: | Line 709: | ||

tsquare = sum(tmp.*tmp)'; | tsquare = sum(tmp.*tmp)'; | ||

− | + | We should compare the following aspects of the above code with the SVD method: | |

First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>. | First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>. | ||

Line 583: | Line 719: | ||

The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>. | The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>. | ||

− | The following is an example to perform PCA using princomp and SVD respectively to get the same | + | The following is an example to perform PCA using princomp and SVD respectively to get the same result. |

:SVD method | :SVD method | ||

>> load 2_3 | >> load 2_3 | ||

Line 596: | Line 732: | ||

Then we can see that y=score, v=U. | Then we can see that y=score, v=U. | ||

− | '''useful | + | '''useful resources:''' |

LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf] | LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf] | ||

Line 622: | Line 758: | ||

[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA] | [http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA] | ||

+ | Using discriminant analysis for multi-class classification: an experimental investigation [http://www.springerlink.com/content/6851416084227k8p/fulltext.pdf] | ||

+ | |||

+ | ===Reference articles on solving a small sample size problem when LDA is applied=== | ||

+ | ( Based on Li-Fen Chen, Hong-Yuan Mark Liao, Ming-Tat Ko, Ja-Chen Lin, Gwo-Jong Yu A new LDA-based face recognition system which can solve the small sample size problem Pattern Recognition 33 (2000) 1713-1726 ) | ||

+ | |||

+ | Small sample size indicates that the number of samples is smaller than the dimension of each sample. In this case, the within-class covariance we stated in class could be a singular matrix and naturally we cannot find its inverse matrix for further analysis.However, many researchers tried to solve it by different techniques:<br /> | ||

+ | 1.Goudail et al. proposed a technique which calculated 25 local autocorrelation coefficients from each sample image to achieve dimensionality reduction. (Referenced by F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, N. Otsu, Face recognition system using local autocorrelations and multiscale integration, IEEE Trans. Pattern Anal. Mach. Intell. 18 (10) (1996) 1024-1028.)<br /> | ||

+ | 2.Swets and Weng applied the PCA approach to accomplish reduction of image dimensionality. (Referenced by D. Swets, J. Weng, Using discriminant eigen features for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell.18 (8) (1996) 831-836.)<br /> | ||

+ | 3.Fukunaga proposed a more efficient algorithm and calculated eigenvalues and eigenvectors from an m*m matrix, where n is the dimensionality of the samples and m is the rank of the within-class scatter matrix Sw. (Referenced by K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990.)<br /> | ||

+ | 4.Tian et al. used a positive pseudoinverse matrix instead of calculating the inverse matrix Sw. (Referenced by Q. Tian, M. Barbero, Z.H. Gu, S.H. Lee, Image classification by the Foley-Sammon transform, Opt. Eng. 25 (7) (1986) 834-840.)<br /> | ||

+ | 5.Hong and Yang tried to add the singular value perturbation in Sw and made Sw a nonsingular matrix. (Referenced by Zi-Quan Hong, Jing-Yu Yang, Optimal discriminant plane for a small number of samples and design method of classifier on the plane, Pattern Recognition 24 (4) (1991) 317-324)<br /> | ||

+ | 6.Cheng et al. proposed another method based on the principle of rank decomposition of matrices. The above three methods are all based on the conventional Fisher's criterion function. (Referenced by Y.Q. Cheng, Y.M. Zhuang, J.Y. Yang, Optimal fisher discriminant analysis using the rank decomposition, Pattern Recognition 25 (1) (1992) 101-111.)<br /> | ||

+ | 7.Liu et al. modified the conventional Fisher's criterion function and conducted a number of researches based on the new criterion function. They used the total scatter matrix as the divisor of the original Fisher's function instead of merely using the within-class scatter matrix. (Referenced by K. Liu, Y. Cheng, J. Yang, A generalized optimal set of discriminant vectors, Pattern Recognition 25 (7) (1992) 731-739.) | ||

==Principal Component Analysis - September 30, 2010== | ==Principal Component Analysis - September 30, 2010== | ||

− | ===Rough definition=== | + | |

+ | ===Brief introduction on dimension reduction method=== | ||

+ | |||

+ | Dimension reduction is a process to reduce the number of variables of the data by some techniques. [http://en.wikipedia.org/wiki/Principal_component_analysis Principal components analysis] (PCA) and factor analysis are two primary classical methods on dimension reduction. PCA is a method to create some new variables by a linear combination of the variables in the data and the number of new variables depends on what proportion of the variance the new ones contribute. On the contrary, factor analysis method tries to express the old variables by the linear combination of new variables. So before creating the expressions, a certain number of factors should be determined firstly by analysis on the features of old variables. In general, the idea of both PCA and factor analysis is to use as less as possible mixed variables to reflect as more as possible information. | ||

+ | |||

+ | ===Rough definition=== | ||

Keepings two important aspects of data analysis in mind: | Keepings two important aspects of data analysis in mind: | ||

Line 634: | Line 788: | ||

− | Furthermore, if one considers the lower dimensional representation produced by PCA as a least | + | Furthermore, if one considers the lower dimensional representation produced by PCA as a least square fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted, however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA should be selected to be the most informative dimensions in order to create the lower-dimensional representation. |

Line 653: | Line 807: | ||

− | PCA takes a sample of | + | PCA takes a sample of <math>\, d</math> - dimensional vectors and produces an orthogonal(zero covariance) set of <math>\, d</math> 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc. |

− | Then we can preserve most of the variance in the sample in a lower dimension by choosing the first | + | Then we can preserve most of the variance in the sample in a lower dimension by choosing the first <math>\, k</math> Principle Components and approximating the data in <math>\, k</math> - dimensional space, which is easier to analyze and plot. |

===Principal Components of handwritten digits=== | ===Principal Components of handwritten digits=== | ||

Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. | Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. | ||

− | |||

− | |||

− | |||

− | |||

− | |||

Line 736: | Line 885: | ||

====Lagrange Multiplier==== | ====Lagrange Multiplier==== | ||

− | Before we can proceed, we must review Lagrange | + | Before we can proceed, we must review Lagrange multipliers. |

[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]] | [[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]] | ||

Line 755: | Line 904: | ||

====Example==== | ====Example==== | ||

− | Suppose we wish to | + | Suppose we wish to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is: |

<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math> | <math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math> | ||

Line 791: | Line 940: | ||

<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math> | <math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math> | ||

<br><br /> | <br><br /> | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /> | From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /> | ||

Line 802: | Line 945: | ||

<br><br /> | <br><br /> | ||

In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability. | In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability. | ||

− | |||

D dimensional data will have D eigenvectors | D dimensional data will have D eigenvectors | ||

Line 812: | Line 954: | ||

<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math> | <math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math> | ||

+ | If two eigenvalues happen to be equal, then the data has the same amount of variation in each of the two directions that they correspond to with. If only one of the two equal eigenvalues are to be chosen for dimensionality reduction, then either will do. Note that if ALL of the eigenvalues are the same then this means that the data is on the surface of a d-dimensional sphere (all directions have the same amount of variation). | ||

Note that the Principal Components decompose the total variance in the data: | Note that the Principal Components decompose the total variance in the data: | ||

Line 882: | Line 1,025: | ||

::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>. | ::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>. | ||

::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data. | ::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data. | ||

+ | |||

+ | |||

+ | ==== Feature Extraction Uses and Discussion ==== | ||

+ | |||

+ | PCA, as well as other feature extraction methods not within the scope of the course [http://en.wikipedia.org/wiki/Feature_extraction] are used as a first step to classification in enhancing generalization capability: one of the classification aspects that will be discussed later in the course is model complexity. As a classification model becomes more complex over its training set, classification error over test data tends to increase. By performing feature extraction prior to attempting classification, we restrict model inputs to only the most important variables, thus decreasing complexity and potentially improving test results. | ||

+ | |||

+ | Feature ''selection'' methods, that are used to select subsets of relevant features for building robust learning models, differ from extraction methods, where features are transformed. Feature selection has the added benefit of improving model interpretability. | ||

+ | |||

+ | |||

+ | |||

+ | === Independent Component Analysis === | ||

+ | As we have already seen, the Principal Component Analysis (PCA) performed by the Karhunen-Lokve transform produces features <math>\ y ( i ) ; i = 0, 1, . . . , N - 1</math>, that are mutually uncorrelated. The obtained by the KL transform solution is optimal when dimensionality reduction is the goal and one wishes to minimize the approximation mean square error. However, for certain applications, the obtained solution falls short of the expectations. In contrast, the more recently developed Independent Component Analysis (ICA) theory, tries to achieve much more than simple decorrelation of the data. The ICA task is casted as follows: Given the set of input samples <math>\ x</math>, determine an <math>\ N \times N</math> invertible matrix <math>\ W</math> such that the entries <math>\ y(i), i = 0, 1, . . . , N - 1</math>, of the transformed vector | ||

+ | |||

+ | <math>\ y = W.x</math> | ||

+ | |||

+ | are mutually independent. The goal of statistical independence is a stronger condition than the uncorrelatedness required by the PCA. The two conditions are equivalent only for Gaussian random variables. Searching for independent rather than uncorrelated features gives us the means of exploiting a lot more of information, hidden in the higher order statistics of the data. | ||

+ | |||

+ | This topic has brought to you from Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas. (Chapter 6) For further details on the ICA and its varieties, refer to this book. | ||

+ | |||

+ | === References === | ||

+ | 1. Probabilistic Principal Component Analysis | ||

+ | [http://onlinelibrary.wiley.com/doi/10.1111/1467-9868.00196/abstract] | ||

+ | |||

+ | 2. Nonlinear Component Analysis as a Kernel Eigenvalue Problem | ||

+ | [http://www.mitpressjournals.org/doi/abs/10.1162/089976698300017467] | ||

+ | |||

+ | 3. Kernel principal component analysis | ||

+ | [http://www.springerlink.com/content/w0t1756772h41872/] | ||

+ | |||

+ | 4. Principal Component Analysis | ||

+ | [http://onlinelibrary.wiley.com/doi/10.1002/0470013192.bsa501/full] and [http://support.sas.com/publishing/pubcat/chaps/55129.pdf] | ||

+ | |||

+ | === Further Readings === | ||

+ | 1. I. T. Jolliffe "Principal component analysis" Available [http://books.google.ca/books?id=_olByCrhjwIC&printsec=frontcover&dq=principal+component+analysis&hl=en&ei=TooCTaesN42YnweR843lDQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CC4Q6AEwAA#v=onepage&q&f=false here]. | ||

+ | |||

+ | 2. James V. Stone "Independent component analysis: a tutorial introduction" Available [http://books.google.ca/books?id=P0rROE-WFCwC&pg=PA129&dq=principal+component+analysis&hl=en&ei=TooCTaesN42YnweR843lDQ&sa=X&oi=book_result&ct=result&resnum=7&ved=0CEYQ6AEwBg#v=onepage&q=principal%20component%20analysis&f=false here]. | ||

+ | |||

+ | 3. Aapo Hyvärinen, Juha Karhunen, Erkki Oja "Independent component analysis" Available [http://books.google.ca/books?id=96D0ypDwAkkC&printsec=frontcover&dq=independent+component+analysis&hl=en&ei=F4wCTZqjJY2RnAew6pnlDQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CCoQ6AEwAA#v=onepage&q&f=false here]. | ||

== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 == | == Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 == | ||

===Sir Ronald A. Fisher=== | ===Sir Ronald A. Fisher=== | ||

− | Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. | + | Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) in some sources, is a classical [http://en.wikipedia.org/wiki/Feature_extraction feature extraction] technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. |

In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here]. | In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here]. | ||

+ | |||

+ | ===Introduction=== | ||

+ | '''Linear discriminant analysis''' ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) and the related '''Fisher's linear discriminant''' are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. | ||

+ | |||

+ | LDA is also closely related to principal component analysis ([http://en.wikipedia.org/wiki/Principal_component_analysis PCA]) and [http://en.wikipedia.org/wiki/Factor_analysis factor analysis] in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. | ||

+ | |||

+ | LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is '''discriminant correspondence analysis'''. | ||

=== Contrasting FDA with PCA === | === Contrasting FDA with PCA === | ||

Line 895: | Line 1,083: | ||

Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. | Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. | ||

+ | Please note dimention reduction in PCA is different from subspace cluster , see the details about the subspace cluser [http://en.wikipedia.org/wiki/Clustering_high-dimensional_data] | ||

{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}} | {{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}} | ||

{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}} | {{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}} | ||

Line 966: | Line 1,155: | ||

=== Two-class problem === | === Two-class problem === | ||

− | In the two-class problem, we have | + | In the two-class problem, we have prior knowledge that the data points belong to two classes. Conceptually, points of each class form a cloud around the class mean, and each class has an distinct size. To divide points among the two classes, we must determine the class whose mean is closest to each point, and we must also account for the different size of each class given by the covariance of each class. |

Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>, | Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>, | ||

Line 980: | Line 1,169: | ||

{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}} | {{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}} | ||

+ | {{Cleanup|date=December 2010|reason= If using the weighted sum of two covariances, you will need to use the shared mean of the two classes, and the weighted sum will be the shared covariance. Doing this will result in collapsing the two classes into one point, which contradicts the purpose of using FDA}} | ||

As is demonstrated below, both of these goals can be accomplished simultaneously. | As is demonstrated below, both of these goals can be accomplished simultaneously. | ||

Line 1,146: | Line 1,336: | ||

8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8] | 8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8] | ||

+ | |||

+ | 9. Fisher LDA and Kernel Fisher LDA [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf] | ||

==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010== | ==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010== | ||

− | + | ===Obtaining Covariance Matrices=== | |

Line 1,317: | Line 1,509: | ||

eigenvalues with respect to | eigenvalues with respect to | ||

− | {{Cleanup| | + | {{Cleanup|reason=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }} |

+ | {{Cleanup|date=December 2010|reason=Covariance matrices are positive semi-definite. The inverse of a positive semi-definite matrix is positive semi-definite. The product of positive semi-definite matrices is positive semi-definite. The eigenvalues of a positive semi-definite matrix are all real, non-negative values. As a result, the eigenvalues of \mathbf{S}_{W}^{-1}\mathbf{S}_{B} will always be real, non-negative values.}} | ||

:<math> | :<math> | ||

Line 1,419: | Line 1,612: | ||

{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}} | {{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}} | ||

{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }} | {{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }} | ||

+ | |||

+ | ====Advantages of FDA compared with PCA==== | ||

+ | |||

+ | -PCA find components which are useful for representing data. | ||

+ | |||

+ | -While there is no reason to assume that components are useful to discriminate data between classes. | ||

+ | |||

+ | -In FDA , we try to use labels to find the components which are useful for discriminating data. | ||

===Generalization of Fisher's Linear Discriminant Analysis === | ===Generalization of Fisher's Linear Discriminant Analysis === | ||

Line 1,430: | Line 1,631: | ||

(MDA) is also termed Discriminant Factor Analysis and Canonical Discriminant Analysis. It adopts a similar perspective to PCA: the rows of the data matrix to be examined constitute points in a multidimensional space, as also do the group mean vectors. Discriminating axes are determined in this space, in such a way that optimal separation of the predefined groups is attained. As with PCA, the problem becomes mathematically the eigenreduction of a real, symmetric matrix. The eigenvalues represent the discriminating power of the associated eigenvectors. The nYgroups lie in a space of dimension at most <math>n_{y-1}</math>. This will be the number of discriminant axes or factors obtainable in the most common practical case when n > m > nY (where n is the number of rows, and m the number of columns of the input data matrix. | (MDA) is also termed Discriminant Factor Analysis and Canonical Discriminant Analysis. It adopts a similar perspective to PCA: the rows of the data matrix to be examined constitute points in a multidimensional space, as also do the group mean vectors. Discriminating axes are determined in this space, in such a way that optimal separation of the predefined groups is attained. As with PCA, the problem becomes mathematically the eigenreduction of a real, symmetric matrix. The eigenvalues represent the discriminating power of the associated eigenvectors. The nYgroups lie in a space of dimension at most <math>n_{y-1}</math>. This will be the number of discriminant axes or factors obtainable in the most common practical case when n > m > nY (where n is the number of rows, and m the number of columns of the input data matrix. | ||

+ | |||

+ | ===Matlab Example: Multiple Discriminant Analysis for Face Recognition=== | ||

+ | |||

+ | % The following MATLAB code is an example of using MDA in face recognition. The used dataset can be % found be found [http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html here]. IT contains % a set of face images taken between April 1992 and April 1994 at the lab. The database was used in the % context of a face recognition project carried out in collaboration with the Speech, Vision and % Robotics Group of the Cambridge University Engineering Department. | ||

+ | |||

+ | load orl_faces_112x92.mat | ||

+ | u=(mean(faces'))'; | ||

+ | stfaces=faces-u*ones(1,400); | ||

+ | S=stfaces'*stfaces; | ||

+ | [V,E] = eig(S); | ||

+ | U=zeros(length(stfaces),150);%%%%%% | ||

+ | for i=400:-1:251 | ||

+ | U(:,401-i)=stfaces*V(:,i)/sqrt(E(i,i)); | ||

+ | end | ||

+ | |||

+ | defaces=U'*stfaces; | ||

+ | for i=1:40 | ||

+ | for j=1:5 | ||

+ | lsamfaces(:,j+5*i-5)=defaces(:,j+10*i-10); | ||

+ | ltesfaces(:,j+5*i-5)=defaces(:,j+10*i-5); | ||

+ | end | ||

+ | end | ||

+ | stlsamfaces=lsamfaces-lsamfaces*wdiag(ones(5,5),40)/5; | ||

+ | Sw=stlsamfaces*stlsamfaces'; | ||

+ | zstlsamfaces=lsamfaces-(mean(lsamfaces'))'*ones(1,200); | ||

+ | St=zstlsamfaces*zstlsamfaces'; | ||

+ | Sb=St-Sw; | ||

+ | [V D]=eig(Sw\Sb); | ||

+ | U=V(:,1:39); | ||

+ | desamfaces=U'*lsamfaces; | ||

+ | detesfaces=U'*ltesfaces; | ||

+ | rightnum=0; | ||

+ | for i=1:200 | ||

+ | mindis=10^10;minplace=1; | ||

+ | for j=1:200 | ||

+ | distan=norm(desamfaces(:,i)-detesfaces(:,j)); | ||

+ | if mindis>distan | ||

+ | mindis=distan; | ||

+ | minplace=j; | ||

+ | end | ||

+ | end | ||

+ | if floor(minplace/5-0.2)==floor(i/5-0.2) | ||

+ | rightnum=rightnum+1; | ||

+ | end | ||

+ | end | ||

+ | rightrate=rightnum/200 | ||

===K-NNs Discriminant Analysis=== | ===K-NNs Discriminant Analysis=== | ||

Line 1,443: | Line 1,690: | ||

:5.MDA is most appropriately used for feature selection. As in the case of PCA, we may want to focus on the variables used in order to investigate the differences between groups; to create synthetic variables which improve the grouping ability of the data; to arrive at a similar objective by discarding irrelevant variables; or to determine the most parsimonious variables for graphical representational purposes. | :5.MDA is most appropriately used for feature selection. As in the case of PCA, we may want to focus on the variables used in order to investigate the differences between groups; to create synthetic variables which improve the grouping ability of the data; to arrive at a similar objective by discarding irrelevant variables; or to determine the most parsimonious variables for graphical representational purposes. | ||

− | == | + | === Fisher Score === |

+ | Fisher Discriminant Analysis should be distinguished from Fisher Score. Feature score is a means, by which we can evaluate the importance of each of the features in a binary classification task. Here is the Fisher score, or in brief <math>\ FS</math>. | ||

− | + | <math>FS_i=\frac{(\mu_i^1-\mu_i)^2+(\mu_i^2-\mu_i)^2}{var_i^1+var_i^2}</math> | |

− | |||

− | In linear regression, the goal is use a set of training data <math>\{y_i,\, x_{i1}, \ldots, x_{id}\}, i=1, \ldots, n</math> to find a linear combination <math>\,\beta^T = \begin{pmatrix}\beta_1 & \cdots & \beta_d & \beta_0\end{pmatrix}</math> that best explains the variation in <math>\, y</math>. In <math>\,\beta</math>, <math>\,\beta_0</math> is the intercept of the fitted line that approximates the assumed linear relationship between <math>\, y</math> and <math>\,X</math>. <math>\,\beta_0</math> enables this fitted line to be situated away from the origin. In classification, the goal is to classify data into groups so that group members are more similar within groups than between groups. | + | Where <math>\ \mu_i^1</math>, and <math>\ \mu_i^2</math> are the average of the feature <math>\ i</math> for the class 1 and 2 respectively and <math>\ \mu_i</math> is the average of the feature <math>\ i</math> over both of the classes. And <math>\ var_i^1</math>, and <math>\ var_i^2</math> are the variances of the feature <math>\ i</math> in the two classes of 1 and 2 respectively. |

+ | |||

+ | We can estimate the FS over all of the features and then select those features with the highest FS. We want features to discriminate as much as possible between two classes and describe each of the classes as dense as possible; this is exactly the criterion that has been taken into consideration for defining the Fisher Score. | ||

+ | |||

+ | |||

+ | ===References=== | ||

+ | |||

+ | 1. Optimal Fisher discriminant analysis using the rank decomposition | ||

+ | [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V14-48MPMK5-14R&_user=10&_coverDate=01%2F31%2F1992&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1550315473&_rerunOrigin=scholar.google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=b8b00da9ab59b76a40eca456f5aa99b6&searchtype=a] | ||

+ | |||

+ | 2. Face recognition using Kernel-based Fisher Discriminant Analysis | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1004157] | ||

+ | |||

+ | 3. Fisher discriminant analysis with kernels | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=788121] | ||

+ | |||

+ | 4. Fisher LDA and Kernel Fisher LDA [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf] | ||

+ | |||

+ | 5. Previous STAT 841 notes. [http://www.math.uwaterloo.ca/~aghodsib/courses/f07stat841/notes/lecture7.pdf] | ||

+ | |||

+ | 6. Another useful pdf introducing FDA [http://www.cedar.buffalo.edu/~srihari/CSE555/Chap3.Part6.pdf] | ||

+ | |||

+ | ==Random Projection== | ||

+ | Random Project (RP) is an approach of projecting a point from a high dimensional space to a lower dimensional space. In general, a target subspace, presented as a uniform random orthogonal matrix, should be determined firstly and the projected vector can be described as v=c.p.u, where u is a d-dimension vector, p is the uniform random orthogonal matrix with d’ rows and d columns, v is the projected vector with d’-dimension and c is scaling factor such that the expected squared length of v is equal to the squared length of u. For the projected vectors by RP, they have two main properties: | ||

+ | 1. The distance between any two of the original vectors is approximately equal to the distance of their corresponding projected vectors by RP. | ||

+ | 2. If each of entries in the uniform random orthogonal matrix is randomly selected followed by distribution N(0,1), then the expected squared length of v is equal to the squared length of u. | ||

+ | For more details of RP, please see The Random Projection Method by Santosh S. Vempala. | ||

+ | |||

+ | |||

+ | ==Linear and Logistic Regression - October 12, 2010== | ||

+ | |||

+ | ===Linear Regression=== | ||

+ | Linear regression is an approach for modeling the response variable <math>\, y</math> under the assumption that <math>\, y</math> is a [http://en.wikipedia.org/wiki/Linear_function linear function] of a set of [http://en.wikipedia.org/wiki/Regressor explanatory variables] <math>\,X</math>. Any observed deviation from this assumed linear relationship between <math>\, y</math> and <math>\,X</math> is attributed to an unobserved [http://en.wikipedia.org/wiki/Random_variable random variable] <math>\, \epsilon</math> that adds random noise. | ||

+ | |||

+ | In linear regression, the goal is use a set of training data <math>\{y_i,\, x_{i1}, \ldots, x_{id}\}, i=1, \ldots, n</math> to find a linear combination <math>\,\beta^T = \begin{pmatrix}\beta_1 & \cdots & \beta_d & \beta_0\end{pmatrix}</math> that best explains the variation in <math>\, y</math>. In <math>\,\beta</math>, <math>\,\beta_0</math> is the intercept of the fitted line that approximates the assumed linear relationship between <math>\, y</math> and <math>\,X</math>. <math>\,\beta_0</math> enables this fitted line to be situated away from the origin. In classification, the goal is to classify data into groups so that group members are more similar within groups than between groups. | ||

If the data is 2-dimensional, a model of <math>\, y</math> as a function of <math>\,X</math> constructed using training data under the assumption of linear regression typically looks like the one in the following figure: | If the data is 2-dimensional, a model of <math>\, y</math> as a function of <math>\,X</math> constructed using training data under the assumption of linear regression typically looks like the one in the following figure: | ||

Line 1,547: | Line 1,828: | ||

\begin{align} | \begin{align} | ||

\mathbf{\hat y} = \mathbf{X}^{T}\hat{\beta} = | \mathbf{\hat y} = \mathbf{X}^{T}\hat{\beta} = | ||

− | \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y} | + | \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y} = |

+ | \mathbf{H}\mathbf{y} | ||

\end{align} | \end{align} | ||

</math> | </math> | ||

Line 1,578: | Line 1,860: | ||

This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math> | This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math> | ||

[[File:Example.jpg]] | [[File:Example.jpg]] | ||

+ | |||

+ | ==== Recursive Linear Regression ==== | ||

+ | In some applications, we need to estimate the weights of the linear regression in an online scheme. For real-time applications, efficiency of computations comes to be very important. In cases like this, we have a batch data and more data samples are still being observed and according to the whole observed data points, we need for example to predict the class label of the upcoming samples. To be able to do that in real-time we should better take advantage of the computations that we have done up to any given sample point and estimate the new weights -having seen the new sample point- using the previous weights -before observing the new sample point. So we want to update the weights, like this: | ||

+ | |||

+ | <math>\ W_{new}=h(W_{old},x_{new},y_{new})</math> | ||

+ | |||

+ | In which <math>\ W_{new}</math> and <math>\ W_{old}</math> are the linear regression weights after and before observation of the new sample pair, <math>\ (x_{new},y_{new})</math>. The function <math>\ h</math> could be obtained using the following procedure. | ||

+ | |||

+ | <math>\begin{align} | ||

+ | W_{old}&=(XX^T-x_{new}x_{new}^T)^{-1}(XY-x_{new}y_{new}) \\ | ||

+ | \rightarrow (XX^T-x_{new}x_{new}^T)W_{old}&=XY-x_{new}y_{new} \\ | ||

+ | \rightarrow XX^TW_{old}&=XY-x_{new}y_{new}+x_{new}x_{new}^TW_{old} \\ | ||

+ | \rightarrow W_{old}&=(XX^T)^{-1}(XY-x_{new}y_{new}+x_{new}x_{new}^TW_{old}) \\ | ||

+ | \rightarrow W_{old}&=W_{new}+(XX^T)^{-1}(-x_{new}y_{new}+x_{new}x_{new}^TW_{old}) \\ | ||

+ | \rightarrow W_{new}&=W_{old}-(XX^T)^{-1}(-x_{new}y_{new}+x_{new}x_{new}^TW_{old}) | ||

+ | \end{align}</math> | ||

+ | |||

+ | Where <math>\ X</math>, and <math>\ Y</math> represent the whole set of sample points pairs, including the recently seen sample pair, <math>\ (x_{new},y_{new})</math>. | ||

+ | |||

+ | ====Comments about Linear regression model==== | ||

+ | |||

+ | Linear regression model is almost the easiest and most popular way to analyze the relationship of different data sets. However, it has some disadvantages as well as its advantages. We should be clear about them before we apply the model. | ||

+ | |||

+ | ''Advantages'': Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. Though there are types of data that are better described by functions that are nonlinear in the parameters, many processes in science and engineering are well-described by linear models. This is because either the processes are inherently linear or because, over short ranges, any process can be well-approximated by a linear model. The estimates of the unknown parameters obtained from linear least squares regression are the optimal estimates from a broad class of possible parameter estimates under the usual assumptions used for process modeling. Practically speaking, linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets. Finally, the theory associated with linear regression is well-understood and allows for construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations. These statistical intervals can then be used to give clear answers to scientific and engineering questions. | ||

+ | |||

+ | ''Disadvantages'': The main disadvantages of linear least squares are limitations in the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers. Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, the output of the linear model will also always more extreme. This means that linear models may not be effective for extrapolating the results of a process for which data cannot be collected in the region of interest. Of course extrapolation is potentially dangerous regardless of the model type. Finally, while the method of least squares often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit a model. One or two outliers can sometimes seriously skew the results of a least squares analysis. This makes model validation, especially with respect to outliers, critical to obtaining sound answers to the questions motivating the construction of the model. | ||

+ | |||

+ | =====Inverse-Computation Trick for Matrices that are Nearly-Singular===== | ||

+ | |||

+ | The calculation of <math>\, \underline{\beta}</math> in linear regression and in logistic regression (described in the next lecture) requires the calculation of a matrix inverse. For linear regression, <math>\, (\mathbf{X}\mathbf{X}^T)^{-1} </math> must be calculated. Likewise, <math>\, (XWX^T)^{-1}</math> must be produced during the iterative method used for logistic regression. When the matrix <math>\, \mathbf{X}\mathbf{X}^T </math> or <math>\, XWX^T</math> is nearly singular, error resulting from numerical roundoff can be very large. In the case of logistic regression, it may not be possible to determine a solution because the iterative method relies on convergence; with such large error in calculation of the inverse, the solution for entries of <math>\, \underline{\beta}</math> may grow without bound. To improve the condition of the nearly-singular matrix prior to calculating its inverse, one trick is to add to it a very small identity matrix like <math>\, (10^{-10})I</math>. This modification has very little effect on the exact result for the inverse matrix, but it improves the numerical calculation considerably. Now, the inverses to be calculated are <math>\, (\mathbf{X}\mathbf{X}^T + (10^{-10})I)^{-1} </math> and <math>\, (XWX^T + (10^{-10})I)^{-1}</math> | ||

+ | |||

+ | ====Multiple Linear Regression Analysis==== | ||

+ | Multiple linear regression is a statistical analysis which is similar to Linear Regression with the exception that there can be more than one predictor variable. The assumptions of outliers, linearity and constant variance need to be met. One additional assumption that needs to be examined is multicollinearity. Multicollinearity is the extent to which the predictor variables are related to each other. Multicollinearity can be assessed by asking SPSS for the Variance Inflation Factor (VIF). While different researchers have different criteria for what constitutes too high a VIF number, VIF of 10 or greater is certainly reason for pause. If the VIF is 10 or greater, consider collapsing the variables. | ||

===Logistic Regression=== | ===Logistic Regression=== | ||

Line 1,591: | Line 1,906: | ||

1. <math>y = \frac{1}{1+e^{-x}}</math> | 1. <math>y = \frac{1}{1+e^{-x}}</math> | ||

− | 2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math> | + | 2. <math>\frac{dy}{dx} = y(1-y)=\frac{-e^{-x}}{(1+e^{-x})^{2}}</math> |

3. <math>y(0) = \frac{1}{2}</math> | 3. <math>y(0) = \frac{1}{2}</math> | ||

Line 1,635: | Line 1,950: | ||

====Fitting a Logistic Regression==== | ====Fitting a Logistic Regression==== | ||

− | Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function | + | Logistic regression tries to fit a distribution to the data. The common practice in statistics is to fit a density function to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood estimate of <math>\underline\beta</math>, denoted <math>\hat \beta_{ML}</math>, maximizes the probability of observing the training data <math>\{y_i,\, x_{i1}, \ldots, x_{id}\}, i=1, \ldots, n</math> from the maximum likelihood distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time (this is a useful trick, since <math> y_i \in \{0, 1\}</math>): |

:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math> | :<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math> | ||

Line 1,644: | Line 1,959: | ||

\begin{align} | \begin{align} | ||

\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\ | \mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\ | ||

− | &=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad | + | &=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad |

− | + | \end{align} | |

+ | </math> (by independence and identical distribution) | ||

+ | :::<math> | ||

+ | \begin{align} | ||

+ | = \prod_{i=1}^n p(x_{i};\theta) | ||

\end{align} | \end{align} | ||

</math> | </math> | ||

Line 1,676: | Line 1,995: | ||

To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative of the log-likelihood <math>\,l(\beta)</math> with respect to <math>\,\beta</math> in addition to the first derivative of <math>\,l(\beta)</math> with respect to <math>\,\beta</math>. This is demonstrated in the next section. | To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative of the log-likelihood <math>\,l(\beta)</math> with respect to <math>\,\beta</math> in addition to the first derivative of <math>\,l(\beta)</math> with respect to <math>\,\beta</math>. This is demonstrated in the next section. | ||

− | ==== | + | === Example: logistic Regression in MATLAB === |

+ | |||

+ | % function x = logistic(a, y, w, ridge, param) | ||

+ | % | ||

+ | % Logistic regression. Design matrix A, targets Y, optional instance | ||

+ | % weights W, optional ridge term RIDGE, optional parameters object PARAM. | ||

+ | % | ||

+ | % W is a vector with length equal to the number of training examples; RIDGE | ||

+ | % can be either a vector with length equal to the number of regressors, or | ||

+ | % a scalar (the latter being synonymous to a vector with all entries the | ||

+ | % same). | ||

+ | % | ||

+ | % PARAM has fields PARAM.MAXITER (an iteration limit), PARAM.VERBOSE | ||

+ | % (whether to print diagnostic information), PARAM.EPSILON (used to test | ||

+ | % convergence), and PARAM.MAXPRINT (how many regression coefficients to | ||

+ | % print if VERBOSE==1). | ||

+ | % | ||

+ | % Model is | ||

+ | % | ||

+ | % E(Y) = 1 ./ (1+exp(-A*X)) | ||

+ | % | ||

+ | % Outputs are regression coefficients X. | ||

+ | |||

+ | function x = logistic(a, y, w, ridge, param) | ||

− | + | % process parameters | |

− | * Limitations of Logistic Regression: | + | [n, m] = size(a); |

+ | if ((nargin < 3) || (isempty(w))) | ||

+ | w = ones(n, 1); | ||

+ | end | ||

+ | if ((nargin < 4) || (isempty(ridge))) | ||

+ | ridge = 1e-5; | ||

+ | end | ||

+ | if (nargin < 5) | ||

+ | param = []; | ||

+ | end | ||

+ | if (length(ridge) == 1) | ||

+ | ridgemat = speye(m) * ridge; | ||

+ | elseif (length(ridge(:)) == m) | ||

+ | ridgemat = spdiags(ridge(:), 0, m, m); | ||

+ | else | ||

+ | error('ridge weight vector should be length 1 or %d', m); | ||

+ | end | ||

+ | if (~isfield(param, 'maxiter')) | ||

+ | param.maxiter = 200; | ||

+ | end | ||

+ | if (~isfield(param, 'verbose')) | ||

+ | param.verbose = 0; | ||

+ | end | ||

+ | if (~isfield(param, 'epsilon')) | ||

+ | param.epsilon = 1e-10; | ||

+ | end | ||

+ | if (~isfield(param, 'maxprint')) | ||

+ | param.maxprint = 5; | ||

+ | end | ||

+ | |||

+ | % do the regression | ||

+ | x = zeros(m,1); | ||

+ | oldexpy = -ones(size(y)); | ||

+ | for iter = 1:param.maxiter | ||

+ | adjy = a * x; | ||

+ | expy = 1 ./ (1 + exp(-adjy)); | ||

+ | deriv = expy .* (1-expy); | ||

+ | wadjy = w .* (deriv .* adjy + (y-expy)); | ||

+ | weights = spdiags(deriv .* w, 0, n, n); | ||

+ | x = inv(a' * weights * a + ridgemat) * a' * wadjy; | ||

+ | if (param.verbose) | ||

+ | len = min(param.maxprint, length(x)); | ||

+ | fprintf('%3d: [',iter); | ||

+ | fprintf(' %g', x(1:len)); | ||

+ | if (len < length(x)) | ||

+ | fprintf(' ... '); | ||

+ | end | ||

+ | fprintf(' ]\n'); | ||

+ | end | ||

+ | if (sum(abs(expy-oldexpy)) < n*param.epsilon) | ||

+ | if (param.verbose) | ||

+ | fprintf('Converged.\n'); | ||

+ | end | ||

+ | return; | ||

+ | end | ||

+ | oldexpy = expy; | ||

+ | end | ||

+ | warning('logistic:notconverged', 'Failed to converge'); | ||

+ | |||

+ | |||

+ | ====Extension==== | ||

+ | |||

+ | * When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model]. | ||

+ | *An extension of the logistic model to sets of interdependent variables is the [http://en.wikipedia.org/wiki/Conditional_random_field Conditional random field]. | ||

+ | |||

+ | * Advantages and Limitations of Linear Regression Model: | ||

+ | :1. Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results. | ||

+ | :2. Linear regression is often inappropriately used to model non-linear relationships. | ||

+ | :3. Linear regression is limited to predicting numeric output. | ||

+ | :4. A lack of explanation about what has been learned can be a problem. | ||

+ | |||

+ | * Limitations of Logistic Regression: | ||

:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation. | :1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation. | ||

:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient estimates of the paramters in both classes. The more number of features/dimensions of the data, the larger the sample size required. | :2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient estimates of the paramters in both classes. The more number of features/dimensions of the data, the larger the sample size required. | ||

:3. According to [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CC0QFjAC&url=http%3A%2F%2Fwww.csun.edu%2F~ata20315%2Fpsy524%2Fdocs%2FPsy524%2520lecture%252018%2520logistic.ppt&rct=j&q=logistic%20regression%20limitations&ei=mN7RTOC5HcWOnwfP0eho&usg=AFQjCNFBQ8BNxnc7xVArBgVgVWJOnDLMlw&sig2=_6j0mR3r92_xVGtzEJl7oA&cad=rja this source] however, the only real limitation of logistic regression as compared to other types of regression such as linear regression is that the response variable <math>\,y</math> can only take discrete values. | :3. According to [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CC0QFjAC&url=http%3A%2F%2Fwww.csun.edu%2F~ata20315%2Fpsy524%2Fdocs%2FPsy524%2520lecture%252018%2520logistic.ppt&rct=j&q=logistic%20regression%20limitations&ei=mN7RTOC5HcWOnwfP0eho&usg=AFQjCNFBQ8BNxnc7xVArBgVgVWJOnDLMlw&sig2=_6j0mR3r92_xVGtzEJl7oA&cad=rja this source] however, the only real limitation of logistic regression as compared to other types of regression such as linear regression is that the response variable <math>\,y</math> can only take discrete values. | ||

+ | |||

+ | ====Further reading ==== | ||

+ | Some supplemental readings on linear and logistic regression: | ||

+ | |||

+ | 1- A simple method of sample size calculation for linear and logistic regression [http://onlinelibrary.wiley.com/doi/10.1002/%28SICI%291097-0258%2819980730%2917:14%3C1623::AID-SIM871%3E3.0.CO;2-S/pdf here] | ||

+ | |||

+ | 2- Choosing Between Logistic Regression and Discriminant Analysis [http://www.jstor.org/stable/pdfplus/2286261.pdf?acceptTC=true here] | ||

+ | |||

+ | 3- On the existence of maximum likelihood estimates in logistic regression models [http://biomet.oxfordjournals.org/content/71/1/1.full.pdf+html here] | ||

==Lecture summary== | ==Lecture summary== | ||

Line 1,693: | Line 2,115: | ||

===Logistic Regression Model=== | ===Logistic Regression Model=== | ||

− | Recall that in the last lecture, we learned the logistic regression model | + | In statistics, '''logistic regression''' (sometimes called the '''logistic model''' or '''logit model''') is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription. |

+ | |||

+ | |||

+ | Recall that in the last lecture, we learned the logistic regression model: | ||

* <math>P(Y=1 | X=x)=P(\underline{x};\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> | * <math>P(Y=1 | X=x)=P(\underline{x};\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> | ||

Line 1,731: | Line 2,156: | ||

<math> | <math> | ||

− | X^{new} \leftarrow X^{old} - H^{-1}\nabla | + | X^{new} \leftarrow X^{old} - H^{-1}(f)(X^{old})\nabla f(X^{old}) |

</math> | </math> | ||

<br /> | <br /> | ||

− | H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\,\nabla</math> is the [http://en.wikipedia.org/wiki/Gradient gradient] or first derivative vector | + | where <math>\,H</math> is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\,\nabla</math> is the [http://en.wikipedia.org/wiki/Gradient gradient] or first derivative vector. |

<br /> | <br /> | ||

Line 1,761: | Line 2,186: | ||

− | In each of the iterative steps, starting with the existing <math>\,\underline{\beta}^{old}</math> which is initialized with an arbitrarily chosen value, the Newton-Raphson updating rule for obtaining <math>\,\underline{\beta}^{new}</math> is | + | In each of the iterative steps, starting with the existing <math>\,\underline{\beta}^{old}</math> which is initialized with an arbitrarily chosen value, the [http://en.wikipedia.org/wiki/Newton-Raphson Newton-Raphson] updating rule for obtaining <math>\,\underline{\beta}^{new}</math> is |

<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math> | <math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math> | ||

Line 1,780: | Line 2,205: | ||

<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math> | <math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math> | ||

− | The Newton-Raphson step is | + | The [http://en.wikipedia.org/wiki/Newton-Raphson Newton-Raphson] step is |

<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math> | <math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math> | ||

Line 1,816: | Line 2,241: | ||

#<math>\underline{\beta}^{new} \leftarrow (XWX^T)^{-1}XWZ</math>. | #<math>\underline{\beta}^{new} \leftarrow (XWX^T)^{-1}XWZ</math>. | ||

#If <math>\underline{\beta}^{new}</math> is sufficiently close to <math>\underline{\beta}^{old}</math> according to an arbitrarily defined criterion, then stop; otherwise, set <math>\,\underline{\beta}^{old} \leftarrow \underline{\beta}^{new}</math> and another iterative step is made towards convergence between <math>\underline{\beta}^{new}</math> and <math>\underline{\beta}^{old}</math>. | #If <math>\underline{\beta}^{new}</math> is sufficiently close to <math>\underline{\beta}^{old}</math> according to an arbitrarily defined criterion, then stop; otherwise, set <math>\,\underline{\beta}^{old} \leftarrow \underline{\beta}^{new}</math> and another iterative step is made towards convergence between <math>\underline{\beta}^{new}</math> and <math>\underline{\beta}^{old}</math>. | ||

+ | |||

+ | The following Matlab code implements the method above: | ||

+ | |||

+ | Error = 0.01; | ||

+ | |||

+ | %Initialize logistic variables | ||

+ | B_old=0.1*ones(m,1); %beta | ||

+ | W=0.5*ones(n,n); %weights | ||

+ | P=zeros(n,1); | ||

+ | Norm=1; | ||

+ | |||

+ | while Norm>Error %while the change in Beta (represented by the norm between B_new and B_old) is higher than the threshold, iterate | ||

+ | for i=1:n | ||

+ | P(i,1)=exp(B_old'*Xnew(:,i))/(1+exp(B_old'*Xnew(:,i))); | ||

+ | W(i,i)=P(i,1)*(1-P(i,1)); | ||

+ | end | ||

+ | z = Xnew'*B_old + pinv(W)*(ytrain-P); | ||

+ | B_new = pinv(Xnew*W*Xnew')*Xnew*W*z; | ||

+ | Norm=sqrt((B_new-B_old)'*(B_new-B_old)); | ||

+ | B_old = B_new; | ||

+ | end | ||

====Classification==== | ====Classification==== | ||

Line 1,899: | Line 2,345: | ||

[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]] | [[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]] | ||

− | === | + | === Extra Matlab Examples === |

− | + | ==== Example 1 ==== | |

− | |||

− | |||

− | |||

− | |||

− | === | + | % This Matlab code provides a function that uses the Newton-Raphson algorithm |

+ | % to calculate ML estimates of a simple logistic regression. Most of the | ||

+ | % code comes from Anders Swensen, "Non-linear regression." There are two | ||

+ | % elements in the beta vector, which we wish to estimate. | ||

+ | |||

+ | function [beta,J_bar] = NR_logistic(data,beta_start) | ||

+ | x=data(:,1); % x is first column of data | ||

+ | y=data(:,2); % y is second column of data | ||

+ | n=length(x) | ||

+ | diff = 1; beta = beta_start; % initial values | ||

+ | while diff>0.0001 % convergence criterion | ||

+ | beta_old = beta; | ||

+ | p = exp(beta(1)+beta(2)*x)./(1+exp(beta(1)+beta(2)*x)); | ||

+ | l = sum(y.*log(p)+(1-y).*log(1-p)) | ||

+ | s = [sum(y-p); % scoring function | ||

+ | sum((y-p).*x)]; | ||

+ | J_bar = [sum(p.*(1-p)) sum(p.*(1-p).*x); % information matrix | ||

+ | sum(p.*(1-p).*x) sum(p.*(1-p).*x.*x)] | ||

+ | beta = beta_old + J_bar\s % new value of beta | ||

+ | diff = sum(abs(beta-beta_old)); % sum of absolute differences | ||

+ | end | ||

− | A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture. | + | ==== Example 2 ==== |

+ | |||

+ | % This Matlab program illustrates the use of the Newton-Raphson algorithm | ||

+ | % to obtain maximum likelihood estimates of a logistic regression. The data | ||

+ | % and much of the code are taken from Anders Swensen, "Non-linear regression," | ||

+ | % www.math.uio_no/avdc/kurs/ST110/materiale/opti_30.ps. | ||

+ | % First, load and transform data: | ||

+ | load 'beetle.dat'; % load data | ||

+ | m=length(beetle(:,1)) % count the rows in the data matrix | ||

+ | x=[]; % create empty vectors | ||

+ | y=[]; | ||

+ | for j=1:m % expand group data into individual data | ||

+ | x=[x,beetle(j,1)*ones(1,beetle(j,2))]; | ||

+ | y=[y,ones(1,beetle(j,3)),zeros(1,beetle(j,2)-beetle(j,3))]; | ||

+ | end | ||

+ | beetle2=[x;y]'; | ||

+ | |||

+ | % Next, specify starting points for iteration on parameter values: | ||

+ | beta0 = [0; 0] | ||

+ | |||

+ | % Finally, call the function NR_logistic and use its output | ||

+ | [betaml,Jbar] = NR_logistic(beetle2,beta0) | ||

+ | covmat = inv(Jbar) | ||

+ | stderr = sqrt(diag(covmat)) | ||

+ | |||

+ | ==== Example 3 ==== | ||

+ | |||

+ | % function x = logistic(a, y, w) | ||

+ | % Logistic regression. Design matrix A, targets Y, optional | ||

+ | % instance weights W. Model is E(Y) = 1 ./ (1+exp(-A*X)). | ||

+ | % Outputs are regression coefficients X. | ||

+ | function x = logistic(a, y, w) | ||

+ | epsilon = 1e-10; | ||

+ | ridge = 1e-5; | ||

+ | maxiter = 200; | ||

+ | [n, m] = size(a); | ||

+ | if nargin < 3 | ||

+ | w = ones(n, 1); | ||

+ | end | ||

+ | x = zeros(m,1); | ||

+ | oldexpy = -ones(size(y)); | ||

+ | for iter = 1:maxiter | ||

+ | adjy = a * x; | ||

+ | expy = 1 ./ (1 + exp(-adjy)); | ||

+ | deriv = max(epsilon*0.001, expy .* (1-expy)); | ||

+ | adjy = adjy + (y-expy) ./ deriv; | ||

+ | weights = spdiags(deriv .* w, 0, n, n); | ||

+ | x = inv(a' * weights * a + ridge*speye(m)) * a' * weights * adjy; | ||

+ | fprintf('%3d: [',iter); | ||

+ | fprintf(' %g', x); | ||

+ | fprintf(' ]\n'); | ||

+ | if (sum(abs(expy-oldexpy)) < n*epsilon) | ||

+ | fprintf('Converged.\n'); | ||

+ | break; | ||

+ | end | ||

+ | oldexpy = expy; | ||

+ | end | ||

+ | |||

+ | ===Lecture Summary=== | ||

+ | |||

+ | Traditionally, regression parameters are estimated using maximum likelihood. However, other optimization techniques may be used as well. | ||

+ | <br /> | ||

+ | In the case of logistic regression, since there is no closed-form solution for finding zero of the first derivative of the log-likelihood function, the Newton-Raphson algorithm is typically used to estimate parameters. This problem is convex, so the Newton-Raphson algorithm is guaranteed to converge to a global optimum. | ||

+ | <br /> | ||

+ | Logistic regression requires less parameters than LDA or QDA, which makes it favorable for high-dimensional data. | ||

+ | |||

+ | ===Supplements=== | ||

+ | |||

+ | A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture. | ||

+ | |||

+ | |||

+ | ===[http://komarix.org/ac/lr Applications]=== | ||

+ | |||

+ | 1. Collaborative filtering. | ||

+ | |||

+ | 2. Link Analysis. | ||

+ | |||

+ | 3. Times Series with Logistic Regression. | ||

+ | |||

+ | 4. Alias Detection. | ||

+ | |||

+ | ===References=== | ||

+ | |||

+ | 1. Applied logistic regression | ||

+ | [http://books.google.ca/books?hl=en&lr=&id=Po0RLQ7USIMC&oi=fnd&pg=PA1&dq=Logistic+Regression&ots=DmdTni_oGX&sig=PDYTPVdy3T115RtFbBN3_SzX5Vc#v=onepage&q&f=false] | ||

+ | |||

+ | 2. External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks | ||

+ | [http://www.jclinepi.com/article/S0895-4356%2803%2900120-3/abstract] | ||

+ | |||

+ | 3. Logistic Regression: A Self-Learning Text by David G. Kleinbaum, Mitchel Klein [http://books.google.ca/books?id=J7E0JQweHkoC&printsec=frontcover&dq=logistic+regression&hl=en&ei=7WECTcvqMp-KnAeaq6HlDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CD8Q6AEwAg#v=onepage&q&f=false] | ||

+ | |||

+ | 4. Two useful ppt files introducing concepts of logistic regression | ||

+ | [http://www.csun.edu/~ata20315/psy524/docs/Psy524%20lecture%2018%20logistic.pdf] [http://www.daniel-wiechmann.eu/downloads/logreg1.pdf] | ||

== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' == | == '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' == | ||

Line 1,953: | Line 2,507: | ||

These class-conditional probabilities clearly sum to one. <br /><br /> | These class-conditional probabilities clearly sum to one. <br /><br /> | ||

− | In the case of the two-classes problem, it is pretty simple to find the <math>\,\underline{\beta}</math> parameter (the <math>\,\underline{\beta}</math> in two-class logistic regression problems has dimension <math>\,(d+1)\times1</math>), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\,\underline{\beta}</math> is of dimension <math>(d+1)\times(k-1)</math> and the weight matrix <math>W</math> is a dense and non-diagonal matrix. This results in a computationally inefficient yet feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem. This is done by suitably expanding the following: the input vector <math>\,x</math>, the vector of parameters <math>\,\beta</math>, the vector of responses <math>\,y</math>, as well as the <math>\,\underline{P}</math> vector and the <math>\,W</math> matrix used in the Newton-Raphson updating rule. For interested readers, details regarding this re-parametrization can be found in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's "Logistic Regression" slides]. Another major difference between the two-classes logistic regression and the general multi-classes logistic regression is that, unlike the former which uses the logistic sigmoid function, the latter uses the softmax function instead. Details regarding the softmax function can be found in [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap4/Chap4-Part3.pdf Sargur N. Srihari's "Logistic Regression" slides]. | + | In the case of the two-classes problem, it is pretty simple to find the <math>\,\underline{\beta}</math> parameter (the <math>\,\underline{\beta}</math> in two-class logistic regression problems has dimension <math>\,(d+1)\times1</math>), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\,\underline{\beta}</math> is of dimension <math>\ (d+1)\times(k-1)</math> and the weight matrix <math>\ W</math> is a dense and non-diagonal matrix. This results in a computationally inefficient yet feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem. This is done by suitably expanding the following: the input vector <math>\,x</math>, the vector of parameters <math>\,\beta</math>, the vector of responses <math>\,y</math>, as well as the <math>\,\underline{P}</math> vector and the <math>\,W</math> matrix used in the Newton-Raphson updating rule. For interested readers, details regarding this re-parametrization can be found in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's "Logistic Regression" slides]. Another major difference between the two-classes logistic regression and the general multi-classes logistic regression is that, unlike the former which uses the logistic sigmoid function, the latter uses the softmax function instead. Details regarding the softmax function can be found in [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap4/Chap4-Part3.pdf Sargur N. Srihari's "Logistic Regression" slides]. |

The Newton-Raphson updating rule however, remains the same as it is in the two-classes case, i.e. it is still <math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math>. This key point is also addressed in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's slides] given above. | The Newton-Raphson updating rule however, remains the same as it is in the two-classes case, i.e. it is still <math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math>. This key point is also addressed in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's slides] given above. | ||

<br /><br /> | <br /><br /> | ||

− | Note that logistic regression does not assume a distribution for the prior | + | Note that logistic regression does not assume a distribution for the prior, whereas LDA assumes the prior to be Bernoulli. <br /><br /> |

− | + | [http://en.wikipedia.org/wiki/Random_multinomial_logit Random multinomial logit] models combine a random ensemble of multinomial logit models for use as a classifier. | |

− | |||

− | === | + | === Multiple Logistic Regression in Matlab === |

− | |||

− | + | % Examples: Multiple linear regression im Matlab | |

− | + | % Load data on cars identify weight and horsepower as predictors, mileage as the response: | |

+ | load carsmall | ||

+ | x1 = Weight; | ||

+ | x2 = Horsepower; % Contains NaN data | ||

+ | y = MPG; | ||

− | + | % Compute regression coefficients for a linear model with an interaction term: | |

− | + | X = [ones(size(x1)) x1 x2 x1.*x2]; | |

+ | b = regress(y,X); % Removes NaN data | ||

− | + | [[File:mra1.jpg]] | |

− | By minimizing D, we minimize the sum of the distances between the misclassified points and the decision boundary.<br /><br /> | + | % Plot the data and the model: |

+ | |||

+ | scatter3(x1,x2,y,'filled','r') | ||

+ | hold on | ||

+ | x1fit = min(x1):100:max(x1); | ||

+ | x2fit = min(x2):10:max(x2); | ||

+ | [X1FIT,X2FIT] = meshgrid(x1fit,x2fit); | ||

+ | YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT; | ||

+ | mesh(X1FIT,X2FIT,YFIT); | ||

+ | xlabel('Weight'); | ||

+ | ylabel('Horsepower'); | ||

+ | zlabel('MPG'); | ||

+ | view(50,10); | ||

+ | |||

+ | === Matlab Code for Multiple Logistic Regression === | ||

+ | % Calculation of gradient and objective for Logistic | ||

+ | % Multi-Class Classifcation. | ||

+ | % | ||

+ | % function [obj,grad] = mcclogistic(v,Y,V,lambda,l,varargin) | ||

+ | % v - vector of parameters [n*p*l,1] | ||

+ | % Y - rating matrix (labels) [n,m] | ||

+ | % V - the feature matrix [m,p] | ||

+ | % lambda - regularization parameter [scalar] | ||

+ | % l - # of labels (1..l) | ||

+ | % obj - value of objective at v [scalar] | ||

+ | % grad - gradient at v [n*p*l,1] | ||

+ | % | ||

+ | % Written by Jason Rennie, April 2005 | ||

+ | % Last modified: Tue Jul 25 15:08:38 2006 | ||

+ | function [obj,grad] = mcclogistic(v,Y,V,lambda,l,varargin) | ||

+ | fn = mfilename; | ||

+ | if nargin < 5 | ||

+ | error('insufficient parameters') | ||

+ | end | ||

+ | % Parameters that can be set via varargin | ||

+ | verbose = 1; | ||

+ | % Process varargin | ||

+ | paramgt; | ||

+ | |||

+ | t0 = clock; | ||

+ | [n,m] = size(Y); | ||

+ | p = length(v)./n./l; | ||

+ | if p ~= floor(p) | p < 1 | ||

+ | error('dimensions of v and Y don''t match l'); | ||

+ | end | ||

+ | U = reshape(v,n,p,l); | ||

+ | Z = zeros(n,m,l); | ||

+ | for i=1:l | ||

+ | Z(:,:,i) = U(:,:,i)*V'; | ||

+ | end | ||

+ | obj = lambda.*sum(sum(sum(U.^2)))./2; | ||

+ | dU = zeros(n,p,l); | ||

+ | YY = full(Y==0) + Y; | ||

+ | YI = sub2ind(size(Z),(1:n)'*ones(1,m),ones(n,1)*(1:m),YY); | ||

+ | ZY = Z(YI); | ||

+ | for i=1:l | ||

+ | obj = obj + sum(sum(h(ZY-Z(:,:,i)).*(Y~=i).*(Y>0))); | ||

+ | end | ||

+ | ZHP = zeros(n,m); | ||

+ | for i=1:l | ||

+ | ZHP = ZHP + hprime(ZY-Z(:,:,i)).*(Y~=i).*(Y>0); | ||

+ | end | ||

+ | for i=1:l | ||

+ | dU(:,:,i) = ((Y==i).*ZHP - (Y~=i).*(Y>0).*hprime(ZY-Z(:,:,i)))*V + lambda.*U(:,:,i); | ||

+ | end | ||

+ | grad = dU(:); | ||

+ | if verbose | ||

+ | fprintf(1,'lambda=%.2e obj=%.4e grad''*grad=%.4e time=%.1f\n',lambda,obj,grad'*grad,etime(clock,t0)); | ||

+ | end | ||

+ | |||

+ | function [ret] = h(z) | ||

+ | ret = log(1+exp(-z)); | ||

+ | |||

+ | function [ret] = hprime(z) | ||

+ | ret = -(exp(-z)./(1+exp(-z))); | ||

+ | |||

+ | % ChangeLog | ||

+ | % 7/25/06 - Added varargin, verbose | ||

+ | % 3/23/05 - made calcultions take better advantage of sparseness | ||

+ | % 3/18/05 - fixed bug in objective (wasn't squaring fro norms) | ||

+ | % 3/1/05 - added objective calculation | ||

+ | % 2/23/05 - fixed bug in hprime() | ||

+ | |||

+ | The code is from [http://people.csail.mit.edu/jrennie/matlab/mcclogistic.m here]. | ||

+ | Click [http://people.csail.mit.edu/jrennie/matlab/ here] for more information. | ||

+ | |||

+ | ===Neural Network Concept[http://en.wikipedia.org/wiki/Neural_network]=== | ||

+ | The concept of constructing an artificial neural network came from scientists who were interested in simulating the human neural network in their computers. They were trying to create computer programs that could learn like people. A neural network is a method in artificial intelligence, and it was thought to be a simplified model of neural processing in the brain. Later studies showed that the human neural network is much more complicated, and the structure described here is not a good model for the biological architecture of the brain. Although neural network was developed in an attempt to synthesize the human brain, in actuality it has nothing to do with the human neural system. | ||

+ | |||

+ | === Perceptron === | ||

+ | |||

+ | ==== Content ==== | ||

+ | |||

+ | [http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of Feed-Forward neural networks. The perceptron quickly became very popular after it was introduced, because it was shown to be able to solve many classes of useful problems. However, in 1969, [http://en.wikipedia.org/wiki/Marvin_Minsky Marvin Minsky] and [http://en.wikipedia.org/wiki/Seymour_Papert Seymour Papert] published their book [http://en.wikipedia.org/wiki/Perceptrons_%28book%29 ''Perceptrons'' (1969)] in which the authors strongly criticized the perceptron regarding its inability of solving simple [http://en.wikipedia.org/wiki/XOR exclusive-or (XOR)] problems, which are not linearly separable. Indeed, the simple perceptron and the single hidden-layer perceptron neural network [http://homepages.gold.ac.uk/nikolaev/311perc.htm] are not able to solve any problem that is not linearly-separable. However, it was known to the authors of this book that the multi-layer perceptron neural network can in fact solve any type of problem, including ones that are not linearly separable such as exclusive-or problems, although no efficient learning algorithm was available at that time for this type of neural network. Because of the book ''Perceptrons'', interest regarding perceptrons and neural networks in general greatly declined to a much lower point as compared to before this book was published and things stayed that way until 1986 when the [http://en.wikipedia.org/wiki/Back-propagation back-propagation] learning algorithm (which is discussed in detail below) for neural networks was popularized. <br /><br /> | ||

+ | |||

+ | We know that the least-squares obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math> leads to the same coefficients as LDA (recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points). Least squares returns the sign of the linear combination of features as the class labels (Figure 2). This concept was called the Perceptron in Engineering literature during the 1950's. <br /><br /> | ||

+ | |||

+ | [[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]] | ||

+ | |||

+ | There is a cost function <math>\,\displaystyle D</math> that the Perceptron tries to minimize:<br /> | ||

+ | |||

+ | <math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /> | ||

+ | |||

+ | where <math>\,\displaystyle M</math> is the set of misclassified points. <br><br /> | ||

+ | |||

+ | By minimizing D, we minimize the sum of the distances between the misclassified points and the decision boundary.<br /><br /> | ||

'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /> | '''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /> | ||

Line 2,005: | Line 2,667: | ||

However, this quantity is not always positive. Consider <math>\,y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>. If <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive (since only misclassified points are passed to D). <br /><br /> | However, this quantity is not always positive. Consider <math>\,y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>. If <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive (since only misclassified points are passed to D). <br /><br /> | ||

+ | |||

+ | === Perceptron in Action === | ||

+ | Here is a Java applet [http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html] which may help with the procedure of Perceptron perception. This applet has been developed in the Laboratory of Computational Neuroscience, University of EPFL, Lausanne, Switzerland. | ||

+ | |||

+ | This second applet [http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletPerceptron.html] is developed in the Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey. | ||

+ | |||

+ | This third Java applet [http://neuron.eng.wayne.edu/java/Perceptron/New38.html] has been provided by the Computation and Neural Networks Laboratory, College of Engineering, Wayne State University, Detroit, Michigan. | ||

+ | |||

+ | This fourth applet [http://husky.if.uidaho.edu/nn/jdemos/05/Fred%20Corbett/www.etimage.com/java/appletNN/NeuronTyper/MultiLayerPerceptron/perceptron.html] is provided on the official website of the University of Idaho at Idaho Falls. | ||

+ | |||

+ | === Further Reading for Perceptron === | ||

+ | |||

+ | 1. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities | ||

+ | [http://www.mitpressjournals.org/doi/abs/10.1162/neco.1991.3.4.461] | ||

+ | |||

+ | 2. A perceptron network for functional identification and control of nonlinear systems | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=286893] | ||

+ | |||

+ | 3. Neural network classifiers estimate Bayesian a posteriori probabilities | ||

+ | [http://www.mitpressjournals.org/doi/abs/10.1162/neco.1991.3.4.461] | ||

==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 == | ==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 == | ||

Line 2,011: | Line 2,693: | ||

To open the Neural Network discussion, we present a formulation of the [http://en.wikipedia.org/wiki/Universal_approximation_theorem universal function approximator]. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section. | To open the Neural Network discussion, we present a formulation of the [http://en.wikipedia.org/wiki/Universal_approximation_theorem universal function approximator]. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section. | ||

+ | |||

+ | There is useful information in [http://page.mi.fu-berlin.de/rojas/neural/chapter/K4.pdf] by R. Rojas about Perceptron learning. | ||

===Perceptron=== | ===Perceptron=== | ||

Line 2,092: | Line 2,776: | ||

====Some notes on the Perceptron Learning Algorithm==== | ====Some notes on the Perceptron Learning Algorithm==== | ||

− | * If there is access to the training data points in a batch form, | + | * If there is access to the training data points in a batch form, it is better to take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed form solutions have been around many years before the invention of Perceptron). |

− | * Just like | + | * Just like a linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods. |

* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately. | * If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately. | ||

− | * If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases | + | * If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative). |

+ | |||

+ | * In the case of linearly separable classes, the final solution and the number of iterations will be dependent on the initial values (which are arbitrarily chosen), the learning rate (for example, fixed or adaptive), and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge. | ||

− | * | + | * Learning rate --or updating step-- has a direct impact on both the number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities of this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values of the learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get to an accurate enough solution fast enough (exploration vs. exploitation). In addition, an adaptive learning rate that starts off with a large value and then gradually decreases to a small value over the steps toward convergence can be used in place of a fixed learning rate. |

− | + | In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar to the iteration optimization scheme to what the Perceptron suggests, but have a different definition for the cost function. | |

− | In the | + | ===An example of the determination on learning rate=== |

+ | ( Based on J. Amini Optimum Learning Rate in Back-Propagation Neural Network for Classification | ||

+ | of Satellite Images (IRS-1D) Scientia Iranica, Vol. 15, No. 6, pp. 558-567 ) | ||

+ | |||

+ | Learning rate plays an important role in the application of Neural Network (NN). Choosing an optimum learning rate helps us to obtain the best regression model with the fastest possible speed. In the application of NN by different algorithms, the optimum learning rate tends to be determined differently. In the paper, Optimum Learning Rate in Back-Propagation Neural Network for Classification of Satellite Images (IRS-1D), the author applied one hidden layer and two hidden layers as networks to satellite images by Variable Learning Rate (VLR) algorithms and compared their optimum learning rates based on the various networks. In practice, the number of neurons should not be very small or very large. Since the network with too few neurons does not have enough degrees of freedom to train the data, but the network with too many neurons is more likely to lead to over fitting, the range of the number of neurons in the experiment is from 3 to 40. Finally, the optimum learning rate under various cases keeps 0.001-0.006. In practice, we could use a similar way to estimate the optimum learning rate to improve our models. For more details, please see the article mentioned above. | ||

===Universal Function Approximator=== | ===Universal Function Approximator=== | ||

Line 2,130: | Line 2,820: | ||

===Feed-Forward Neural Network=== | ===Feed-Forward Neural Network=== | ||

− | Neural Network (NN) is one instance | + | Neural Network (NN) is one instance of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network ([http://www.learnartificialneuralnetworks.com/robotcontrol.html#aproach1 FFNN]), which consists of multiple "hidden layers" of Perceptron units (also known as neurons). Our discussion here is based around the FFNN, which has a topology shown in Figure 1. The neurons in the first hidden layer take their inputs, the original features (the <math>\,x_i</math>'s), and pass their inputs unchanged as their outputs to the first hidden layer. From the first layer (the input layer) to the last hidden layer, connections from each neuron are always directed to the neurons in the next adjacent layer. In the output layer, which receives input only from the last hidden layer, each neuron produces a target measurement for a distinct class. <math>\,K</math> classes typically require <math>\,K</math> output neurons in the output layer. In the case where the target variable has two values, it suffices to have one output node in the output layer, although it is generally necessary for the single output node to have a sigmoid activation function so as to restrict the output of the neural network to be a value between 0 and 1. As shown in Figure 1, the neurons in a single layer are typically distributed vertically, and the inputs and outputs of the network are shown as the far left layer and the far right layer, respectively. Furthermore, as shown in Figure 1, it is often useful to add an extra hidden node to each hidden layer that represents the bias term (or the intercept term) of that hidden layer's hyperplane. Each bias node usually outputs a constant value of -1. The purpose of adding a bias node to each hidden layer is to ensure that the hyperplane of that hidden layer does not necessarily have to pass through the origin. In Figure 1, the bias node in the single hidden layer is the topmost hidden node in that layer. |

[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]] | [[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]] | ||

Line 2,155: | Line 2,845: | ||

<math>\hat{y}_k = \sum_{j=1}^{p}\underline{w}_{kj}^T\underline{z}_j, k={1,...,K}</math>. | <math>\hat{y}_k = \sum_{j=1}^{p}\underline{w}_{kj}^T\underline{z}_j, k={1,...,K}</math>. | ||

− | <math>\,\hat y_k</math> is thus the target measurement for the <math>\,k</math>th class. It is not necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of regression since the outputs are continuous, though it is necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of classification so as to ensure that the outputs are discrete. | + | <math>\,\hat y_k</math> is thus the target measurement for the <math>\,k</math>th class. It is not necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of regression since the outputs are continuous, though it is necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of classification so as to ensure that the outputs are in the <math> [0, 1]</math> interval. |

+ | {{Cleanup|date=December 2010|reason=The sentence above is misleading, I think. The outputs will not be discrete, we need the activation function in order to keep them in the {0,1} interval. Please correct me if I'm wrong.}} | ||

Notice that in each neuron, two operations take place one after the other: | Notice that in each neuron, two operations take place one after the other: | ||

Line 2,170: | Line 2,861: | ||

[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]] | [[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]] | ||

− | The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, as mentioned above, it is necessary to have a threshold stage for each of the hidden and output neurons using an activation function. | + | The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, as mentioned above, it is necessary to have a threshold stage for each of the hidden and output neurons using an activation function. |

====Mathematical Model of the FFNN with Multiple Hidden Layers==== | ====Mathematical Model of the FFNN with Multiple Hidden Layers==== | ||

Line 2,235: | Line 2,926: | ||

− | * Propagate | + | * Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons. |

− | * Once <math>\underline{x}</math> reaches the output layer, calculate the output(s) of all output neuron(s). | + | * Once <math>\underline{x}</math> reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer. |

Line 2,248: | Line 2,939: | ||

Usually, a fairly large number of epochs is necessary for training the FFNN so that the network weights would be close to being their optimal values. The learning rate <math> \,\rho </math> should be chosen carefully. Usually, <math> \,\rho </math> should satisfy <math> \,\rho \rightarrow 0 </math> as the iteration times <math> i \rightarrow \infty </math>. [http://www.youtube.com/watch?v=fJ7eH0Y7xEM This] is an interesting video depicting the training procedure of the weights of an FFNN using the back-propagation algorithm. | Usually, a fairly large number of epochs is necessary for training the FFNN so that the network weights would be close to being their optimal values. The learning rate <math> \,\rho </math> should be chosen carefully. Usually, <math> \,\rho </math> should satisfy <math> \,\rho \rightarrow 0 </math> as the iteration times <math> i \rightarrow \infty </math>. [http://www.youtube.com/watch?v=fJ7eH0Y7xEM This] is an interesting video depicting the training procedure of the weights of an FFNN using the back-propagation algorithm. | ||

+ | |||

+ | A Matlab implementation of the pseudocode above is given as an example in the Weight Decay subsection under the [[Regularization for Neural Network - November 4, 2010|Regularization]] title. | ||

====Alternative Description of the Back-Propagation Algorithm==== | ====Alternative Description of the Back-Propagation Algorithm==== | ||

Line 2,313: | Line 3,006: | ||

end | end | ||

− | === | + | === The Neural Network Toolbox in Matlab === |

− | * The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function. | + | % Here is a problem consisting of inputs P and targets T that we would like to solve with a network. |

+ | P = [0 1 2 3 4 5 6 7 8 9 10]; | ||

+ | T = [0 1 2 3 4 3 2 1 2 3 4]; | ||

+ | |||

+ | % Here a network is created with one hidden layer of 5 neurons. | ||

+ | net = newff(P,T,5); | ||

+ | |||

+ | % Here the network is simulated and its output plotted against the targets. | ||

+ | Y = sim(net,P); | ||

+ | plot(P,T,P,Y,’o’) | ||

+ | |||

+ | [[File:nn1.jpg]] | ||

+ | |||

+ | % Here the network is trained for 50 epochs. Again the network’s output is plotted. | ||

+ | net.trainParam.epochs = 50; | ||

+ | net = train(net,P,T); | ||

+ | Y = sim(net,P); | ||

+ | plot(P,T,P,Y,’o’) | ||

+ | |||

+ | [[File:nn2.jpg]] | ||

+ | |||

+ | ====Some notes on the neural network and its learning algorithm==== | ||

+ | |||

+ | * The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function. | ||

* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set. | * Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set. | ||

Line 2,331: | Line 3,047: | ||

Neural Network with Back-propagation faces some subtleties. | Neural Network with Back-propagation faces some subtleties. | ||

− | Deep Neural Networks became popular two or three years ago, when introduced by Dr. Geoffrey E. Hinton. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers. | + | Deep Neural Networks became popular two or three years ago, when introduced by Dr. Geoffrey E. Hinton, a Professor in computer science at the University of Toronto. Deep Neural Network training algorithm [http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf] deals with the training of a Neural Network with a large number of layers. |

The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth. | The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth. | ||

Line 2,406: | Line 3,122: | ||

* Money Laundering Detection with a Neural-Network | * Money Laundering Detection with a Neural-Network | ||

* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage | * Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage | ||

+ | |||

+ | === Further readings === | ||

+ | Bishop,C. "Neural Networks for Pattern Recognition" | ||

+ | |||

+ | Haykin, Simon. "Neural Networks. A Comprehensive Foundation" Available [http://www.esnips.com/doc/83becbe7-0fa6-4f90-a7c4-34697b63a8cb/Neural-Networks---A-Comprehensive-Foundation---Simon-Haykin here] | ||

+ | |||

+ | Nilsson,N. "Introduction to Machine Learning", Chapter 4: Neural Networks. Available [http://robotics.stanford.edu/people/nilsson/mlbook.html here] | ||

+ | |||

+ | Brian D. Ripley "Pattern Recognition and Neural Networks" Available [http://books.google.com/books?id=m12UR8QmLqoC&printsec=frontcover&dq=Neural+Networks+for+Pattern+Recognition&hl=en&ei=r3YCTbOlDMiYnAfh_JXmDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CDYQ6AEwAg#v=onepage&q&f=false here] | ||

+ | |||

+ | G. Dreyfus "Neural networks: methodology and applications" Available [http://books.google.com/books?id=g2J4J2bLgRQC&printsec=frontcover&dq=Neural+Networks&hl=en&ei=WncCTaimM86lngeg-OzlDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CD4Q6AEwAg#v=onepage&q&f=false here] | ||

===References=== | ===References=== | ||

− | + | ||

+ | 1. On fuzzy modeling using fuzzy neural networks with the back-propagation algorithm | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=159069] | ||

+ | |||

+ | 2. Thirty years of adaptive neural networks: perceptron, madaline and backpropagation | ||

+ | [http://onlinelibrary.wiley.com/doi/10.1002/9780470231616.app7/pdf] | ||

==Complexity Control - October 26, 2010== | ==Complexity Control - October 26, 2010== | ||

=== Lecture Summary === | === Lecture Summary === | ||

− | Selecting the model structure with an appropriate complexity is a standard problem in pattern recognition and machine learning. Systems with the optimal complexity have a good [http://www.csc.kth.se/~orre/snns-manual/UserManual/node16.html generalization] to | + | Selecting the model structure with an appropriate complexity is a standard problem in pattern recognition and machine learning. Systems with the optimal complexity have a good [http://www.csc.kth.se/~orre/snns-manual/UserManual/node16.html generalization] to yet unobserved data. |

− | A wide range of techniques may be used which alter the system complexity. In this lecture, we present the concepts of over-fitting & under-fitting | + | A wide range of techniques may be used which alter the system complexity. In this lecture, we present the concepts of over-fitting & under-fitting, and an example to illustrate how we choose a good classifier and how to avoid over-fitting. |

− | Moreover, [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] has been introduced during the lecture which is a method for estimating generalization error based on " | + | Moreover, [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] has been introduced during the lecture which is a method for estimating generalization error based on "re-sampling" (Weiss and Kulikowski 1991; Plutowski, Sakata, and White 1994; Shao and Tu 1995)[1],[2],[3]. The resulting estimates of generalization error are often used for choosing among various models. A model which is associated with the smallest estimated generalization error would be selected. Finally, the common types of cross-validation have been addressed. |

− | Before starting | + | Before starting the next section, a short description of model complexity is necessary. As the name suggests, model complexity somehow describes how complicated our model is. Suppose we have a feed forward neural network -- if we increase the number of hidden layers or the number of nodes in a specific layer, it makes sense that our model is becoming more complex. Or, suppose we want to fit a polynomial function on some data points -- if we add to the degree of this polynomial it seems that we are choosing a more complex model. Intuitively, it seems that fitting a more complex model would be better, since we have more degrees of freedom and can get a more exact answer. The next section will explain why this is not the case, and why there is a trade-off between model complexity and optimal result. This makes it necessary to find methods for controlling complexity in model selection. We will see this procedure in an example. |

=== Over-fitting and Under-fitting === | === Over-fitting and Under-fitting === | ||

Line 2,429: | Line 3,161: | ||

#Underfitting | #Underfitting | ||

− | Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function and the only goal | + | Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function, and the only goal is to build a model that is as complex as needed to pass through every training data point. |

− | However, in the real-world, the training data are [http://en.wikipedia.org/wiki/Statistical_noise noisy], i.e. they tend to not lie on the underlying function | + | However, in the real-world, the training data are [http://en.wikipedia.org/wiki/Statistical_noise noisy], i.e. they tend to not lie exactly on the underlying function, instead they may be shifted to unpredictable locations by random noise. If the model is more complex than what it needs to be in order to accurately fit the underlying function, then it would end up fitting most or all of the training data. Consequently, it would be a poor approximation of the underlying function and have poor prediction ability on new, unseen data. |

− | The | + | The danger of overfitting is that the model becomes susceptible to predicting values outside of the range of training data. It can cause wild predictions in multilayer perceptrons, even with noise-free data. The best way to avoid overfitting is to use lots of training data. Unfortunately, that is not always useful. Increasing the training data alone does not guarantee that over-fitting will be avoided. The best strategy is to use a large-enough size training set, and control the complexity of the model. The training set should have a sufficient number of data points which are sampled appropriately, so that it is representative of the whole data space. |

− | In a Neural Network if the | + | In a Neural Network, if the number of hidden layers or nodes is too high, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will fit the training set very precisely, but will not be able to generalize the commonality of the training set to predict the outcome of new cases. |

− | Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set. | + | Underfitting occurs when the model we picked to describe the data is not complex enough, and has a high error rate on the training set. |

There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur. | There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur. | ||

'''Example''' | '''Example''' | ||

− | #Consider the example | + | #Consider the example shown in the figure. We have a training set and want to find a model which fits it best. We can find a polynomial of high degree which passes through almost all points in the training set. But in reality, the training set comes from a linear model. Although the complex model has little error on the training set, it diverges from the line in other ranges in which we have no training points. As a result, the high degree polynomial has very poor prediction power on the test cases. This is an example of overfitted model. |

#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem. | #Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem. | ||

− | #Consider a simple classification example | + | #Consider a simple classification example: if our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie. we have overfit the data. This occurs when we have considered so many features that we have perfectly described our existing banana that we training on, but if presented with a new banana of a slightly different shape for example, it may not be detected. This is the tradeoff: what is the right level of complexity? |

− | Overfitting occurs when the model | + | Overfitting occurs when the model is too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Some of the assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well. |

[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 2: An example of a model with a family of polynomials]] | [[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 2: An example of a model with a family of polynomials]] | ||

Line 2,453: | Line 3,185: | ||

[[File:extrem_model.jpg|400px|thumb|right|Figure 3]] | [[File:extrem_model.jpg|400px|thumb|right|Figure 3]] | ||

− | After the | + | After the structure of the model is determined, the next step is do the model selection. The problem encountered is how to estimate the parameters effectively, especially when we use iteration methods to do the estimation. In the iteration method, the key point is to determine the best time to stop updating parameters. |

− | Let us see a very simple example; | + | Let us see a very simple example; assume the dotted line on the graph can be expressed as a function <math>\,h(x)</math>, and the data points (the circles) are generated by the function with added noise. |

'''Model 1'''(as shown on the left of Figure 3) | '''Model 1'''(as shown on the left of Figure 3) | ||

− | A line can be used to describe the data points, where two | + | A line <math>\,g(x)</math> can be used to describe the data points, where two parameters are needed to construct the estimated function. However, it is clear that it performs badly. This model is a typical example of an underfitted model. In this case, the model will perform well in prediction, but a large bias could be generated. |

'''Model 2''' (as shown on the right of Figure 3) | '''Model 2''' (as shown on the right of Figure 3) | ||

− | + | In this model, lots of parameters are used to fit the data. Although it looks like a fairly good fit, the prediction performance could be very bad. This means that this model will generate a large variance when we use it on points not part of the training data. | |

− | The models above are the extreme | + | The models above are the extreme cases in the model selection, we do not want to choose any of them in our classification task. The key is to stop our training process at the optimal time, such that a balance of bias and variance is obtained, that is, the time t in the following graph. |

[[File:optimal_time.jpg|300px|thumb|right|Figure 4]] | [[File:optimal_time.jpg|300px|thumb|right|Figure 4]] | ||

− | To achieve | + | To achieve this goal, one approach we can use is to divide our data points into two groups: one (training set) is used in the training process to obtain parameters, the other one (validation set) is used for determining the optimal model. After every update of parameters, the test in the validation set is implemented and the error curve is plotted to find the optimal point <math>\,t</math>. Here, the validation test is a good measure of generalization. Remember to not update the parameters in the validation test. If another, independent test is needed to follow validation, three independent groups should be determined at the beginning. In addition, this approach is suitable for the case of more data points, especially a finite data set, since the effect of noise could be decreased to the lowest level. |

− | So far, we have | + | So far, we have learned two of the most popular ways to estimate the expected level of fit of a model to a test data set that is independent of the data used to train the model: |

:1. Cross validation | :1. Cross validation | ||

− | :2. Regularization: refers to a series of techniques we can use to suppress overfitting,that is, making our function not so curved | + | :2. Regularization: refers to a series of techniques we can use to suppress overfitting, that is, making our function not so curved that it performs badly in prediction. The specific way is to add a new penalty term into the error function, this prevents increasing the weights too much when they are updated at each iteration. |

Indeed, there are many techniques could be used, such as: | Indeed, there are many techniques could be used, such as: | ||

Line 2,478: | Line 3,210: | ||

===='''Note'''==== | ===='''Note'''==== | ||

− | When the model is linear, the true error form AIC approach is identical to that from Cp approach; | + | When the model is linear, the true error form AIC approach is identical to that from Cp approach; when the model is nonlinear, they are different. |

=== '''How do we choose a good classifier?''' === | === '''How do we choose a good classifier?''' === | ||

Line 2,491: | Line 3,223: | ||

<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 3]]</span> | <span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 3]]</span> | ||

− | There is a downward bias to training error estimate | + | There is a downward bias to the training error estimate, it is always less than the true error rate. |

− | If there is a change in our complexity from low to high, our training (empirical) error rate is always | + | If there is a change in our complexity from low to high, our training (empirical) error rate is always decreased. When we apply our model to the test data, our error rate will decrease to a point, but then it will increase because the model has not seen the test data points before. This results in a convex test error curve as a function of learning model complexity. The training error will decrease when we keep fitting increasingly complex models, but as we have seen, a model too complex will not generalize well, resulting in a large test error. |

We use our test data (from the test sample line shown on Figure 2) to get our true error rate. | We use our test data (from the test sample line shown on Figure 2) to get our true error rate. | ||

− | Right complexity is defined as where true error rate ( the error rate associated with the test data) is minimum; | + | Right complexity is defined as the point where the true error rate ( the error rate associated with the test data) is minimum; this is one idea behind complexity control. |

[[File:Bias.jpg|200px|thumb|left|Figure 4]] | [[File:Bias.jpg|200px|thumb|left|Figure 4]] | ||

Line 2,512: | Line 3,244: | ||

One desired property of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>. | One desired property of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>. | ||

− | However, there is a more important property for an estimator than just being unbiased: | + | However, there is a more important property for an estimator than just being unbiased: low mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we risk a big error. In contrast, a biased estimator with small mean square error will improve the precision of our predictions. |

Hence, our goal is to minimize <math>MSE (\hat{f})</math>. | Hence, our goal is to minimize <math>MSE (\hat{f})</math>. | ||

Line 2,519: | Line 3,251: | ||

<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa. | <math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa. | ||

− | A test error is a good estimation | + | '''Algebraic Proof''': |

+ | |||

+ | <math>MSE (\hat{f}) = E[(\hat{f} - f)^2] = E[(\hat{f} - E(\hat{f}) + E(\hat{f}) - f)^2]</math> | ||

+ | |||

+ | <math>E[(\hat{f} - E(\hat{f}))^2+(E(\hat{f}) - f)^2 + 2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)]</math> | ||

+ | |||

+ | <math>E(\hat{f} - E(\hat{f}))^2 + E(E(\hat{f}) - f)^2 + E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f))</math> | ||

+ | |||

+ | By definition, | ||

+ | |||

+ | <math>E(\hat{f} - E(\hat{f}))^2 = Var(\hat{f})</math> | ||

+ | |||

+ | <math>(E(\hat{f}) - f)^2 = Bias^2(\hat{f})</math> | ||

+ | |||

+ | So we must show that: | ||

+ | |||

+ | <math>E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)) = 0</math> | ||

+ | |||

+ | <math>E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)) = 2E(\hat{f}E(\hat{f})) - \hat{f}f - E(\hat{f})E(\hat{f}) + E(\hat{f})f)</math> | ||

+ | |||

+ | <math>2(E(\hat{f})E(\hat{f}) - E(\hat{f})f - E(\hat{f})E(\hat{f}) + E(\hat{f})f) = 0</math> | ||

+ | |||

+ | |||

+ | A test error is a good estimation of MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias. | ||

+ | |||

+ | === References === | ||

+ | |||

+ | 1. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms | ||

+ | [http://www.springerlink.com/content/u751321011502645.pdf] | ||

− | + | 2. Model complexity control and statistical learning theory | |

+ | [http://www.springerlink.com/content/wh40jlnrbr6cnh9x/] | ||

+ | |||

+ | 3. On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4767011] | ||

+ | |||

+ | 4. Overfitting, Underfitting and Model Complexity | ||

+ | [http://www.chemometrie.com/phd/2_8_1.html] | ||

=== Avoid Overfitting === | === Avoid Overfitting === | ||

Line 2,548: | Line 3,315: | ||

=== Cross-Validation === | === Cross-Validation === | ||

+ | |||

+ | '''Cross-validation''', sometimes called '''rotation estimation''', is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the ''training set''), and validating the analysis on the other subset (called the ''validation set'' or ''testing set''). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. | ||

[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]] | [[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]] | ||

Line 2,574: | Line 3,343: | ||

[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]] | [[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]] | ||

The results from the method above may differ significantly based on the initial choice of T and V. Therefore, we improve simple cross-validation by introducing K-fold cross-validation. | The results from the method above may differ significantly based on the initial choice of T and V. Therefore, we improve simple cross-validation by introducing K-fold cross-validation. | ||

− | The advantage of K-fold cross validation is that all the values in the dataset are eventually used for both training and testing. | + | The advantage of K-fold cross validation is that all the values in the dataset are eventually used for both training and testing. When using K-fold cross validation the number of folds must be considered. If the user has a large data set then more folds can be used because a smaller portion of the total data is needed to train the classifier. This leaves more test data and therefore a better estimate on the test error. Unfortunately, the more folds one uses the longer the cross-validation will run. If the user has a small data set then fewer, larger folds must be taken to properly train the classifier. |

In this case, the algorithm is: | In this case, the algorithm is: | ||

Line 2,581: | Line 3,350: | ||

: 1) Randomly divide the data into K parts with approximately equal size | : 1) Randomly divide the data into K parts with approximately equal size | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

: 2) For k = 1,...,K | : 2) For k = 1,...,K | ||

Line 2,623: | Line 3,386: | ||

− | Leave-one-out cross-validation is similar to k-fold validation by selecting sets of equal size for error estimation. Leave-one-out cross-validation instead removes a single data point, with n-partitions. Each partition is used systematically for testing exactly once whereas the remaining partitions are used for training. For example, we estimate the <math>\,n-1</math> data points with <math>\,m</math> linear models over the <math>\,n</math> sets, and compare the average error rates.The leave-one-out error is the average error over all partitions.<br /> | + | Leave-one-out cross-validation is similar to k-fold validation by selecting sets of equal size for error estimation. Leave-one-out cross-validation instead removes a single data point, with n-partitions. Each partition is used systematically for testing exactly once whereas the remaining partitions are used for training. For example, we estimate the <math>\,n-1</math> data points with <math>\,m</math> linear models over the <math>\,n</math> sets, and compare the average error rates of the m linear model.The leave-one-out error is the average error over all partitions.<br /> |

− | |||

− | |||

Line 2,645: | Line 3,406: | ||

Leave-one-out cross-validation can perform poorly in comparison to k-fold validation. A paper by Breiman compares k-fold (leave-many-out) cross-validation to leave-one-out cross-validation, noting that average prediction loss and downward bias increase from k-fold to leave-one-out <ref>Breiman, L. ''Heuristics of instability and stabilization in model selection,'' Annals of Statistics, 24, 2350-2383 (1996).</ref>. This can be explained by the lower bias of leave-one-out validation, causing an increase in variance. The bias is relative to the size of the sample set compared to the training set [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#Leave-one-out_cross-validation]. As such, as k becomes larger, it becomes more biased and has less variance. Similarly, larger data sets will direct the bias toward zero.<br /><br /> | Leave-one-out cross-validation can perform poorly in comparison to k-fold validation. A paper by Breiman compares k-fold (leave-many-out) cross-validation to leave-one-out cross-validation, noting that average prediction loss and downward bias increase from k-fold to leave-one-out <ref>Breiman, L. ''Heuristics of instability and stabilization in model selection,'' Annals of Statistics, 24, 2350-2383 (1996).</ref>. This can be explained by the lower bias of leave-one-out validation, causing an increase in variance. The bias is relative to the size of the sample set compared to the training set [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#Leave-one-out_cross-validation]. As such, as k becomes larger, it becomes more biased and has less variance. Similarly, larger data sets will direct the bias toward zero.<br /><br /> | ||

− | ==== | + | ====k × 2 cross-validation==== |

+ | This is a variation on k-fold cross-validation. For each fold, we randomly assign data points to two sets d0 and d1, so that both sets are equal size (this is usually implemented as shuffling the data array and then splitting in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0. | ||

+ | This has the advantage that our training and test sets are both large, and each data point is used for both training and validation on each fold. In general, k = 5 (resulting in 10 training/validation operations) has been shown to be the optimal value of k for this type of cross-validation. | ||

* One-item-out: [http://biomet.oxfordjournals.org/content/64/1/29.abstract Asymptotics for and against cross-validation] | * One-item-out: [http://biomet.oxfordjournals.org/content/64/1/29.abstract Asymptotics for and against cross-validation] | ||

* [http://www.springerlink.com/content/tfvyva1cqvtqacvy/fulltext.pdf Leave-one-out style crossvalidation bound for Kernel methods applied to some classification and regression problems] | * [http://www.springerlink.com/content/tfvyva1cqvtqacvy/fulltext.pdf Leave-one-out style crossvalidation bound for Kernel methods applied to some classification and regression problems] | ||

+ | |||

+ | === Matlab Code for Cross Validation === | ||

+ | 1. Generate cross validation index using matlab build-in function 'crossvalind.m'. Click [http://www.mathworks.com/help/toolbox/bioinfo/ref/crossvalind.html here] for details. | ||

+ | |||

+ | 2. Use 'cvpartition.m' to partition data. Click [http://www.mathworks.com/help/toolbox/stats/cvpartition.html here]. | ||

+ | |||

+ | === Further Reading === | ||

+ | 1. Two useful pdf's introducing concepts of cross validation. [http://www.autonlab.org/tutorials/overfit10.pdf] [http://www.autonlab.org/tutorials/overfit10.pdf] | ||

=== References === | === References === | ||

Line 2,657: | Line 3,428: | ||

3. Shao, J. and Tu D. (1995). The Jackknife and Bootstrap. Springer, New York. | 3. Shao, J. and Tu D. (1995). The Jackknife and Bootstrap. Springer, New York. | ||

+ | |||

+ | 4. http://en.wikipedia.org/wiki/Cross-validation_(statistics) | ||

== Radial Basis Function (RBF) Network - October 28, 2010== | == Radial Basis Function (RBF) Network - October 28, 2010== | ||

− | + | ||

− | |||

− | |||

− | |||

[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]] | [[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]] | ||

Line 2,674: | Line 3,444: | ||

* and no weights from the first layer to the hidden layer. | * and no weights from the first layer to the hidden layer. | ||

− | An RBF network can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. A common basis function | + | An RBF network can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. A common basis function for RBF network is a kind of Gaussian function without the scaling factor. |

− | * Note: [http://ibiblio.org/e-notes/Splines/Intro.htm Spline], RBF, Fourier, and similar methods differ only in the basis function.<br /> | + | * Note: [http://ibiblio.org/e-notes/Splines/Intro.htm Spline], RBF, [http://www.aaai.org/Papers/Workshops/1999/WS-99-04/WS99-04-008.pdf Fourier], and similar methods differ only in the basis function.<br /> |

− | RBF networks were first used in solving multivariate interpolation problems and numerical analysis. Their prospect is similar in neural network applications, where the training and query targets are | + | RBF networks were first used in solving multivariate interpolation problems and in numerical analysis. Their prospect is similar in neural network applications, where the training and query targets are continuous. RBF networks are artificial neural networks and they can be applied to Regression, Classification and Time series prediction. |

For example, if we consider <math>\,n</math> data points along a one dimensional line and <math>\,m</math> clusters. An RBF network with radial basis (Gaussian) functions will cluster points around the <math>\,m</math> means, <math>\displaystyle\mu_{j}</math> for <math>j= 1, ..., m</math>. The other data points will be distributed normally around these centers. | For example, if we consider <math>\,n</math> data points along a one dimensional line and <math>\,m</math> clusters. An RBF network with radial basis (Gaussian) functions will cluster points around the <math>\,m</math> means, <math>\displaystyle\mu_{j}</math> for <math>j= 1, ..., m</math>. The other data points will be distributed normally around these centers. | ||

− | * Note: The hidden layer can have a variable number of basis functions (the optimal number of basis function can be determined using the complexity control techniques discussed in the previous section). As usual the more basis functions in the hidden layer, the higher the model complexity will be.<br /> | + | * Note: The hidden layer can have a variable number of basis functions (the optimal number of basis function can be determined using the complexity control techniques discussed in the previous section). As usual, the more basis functions are in the hidden layer, the higher the model complexity will be.<br /> |

+ | |||

+ | RBF networks, K-Means clustering, Probabilistic Neural Networks(PNN) and General Regression Neural Networks(GRNN) are almost the same. The main difference is that PNN/GRNN networks have one neuron for each point in the training file, whereas the number of RBF networks neurons (basis functions) is not set, and it is usually much less than the number of training points. When the size of the training set is not very large, PNN and GRNN perform well. But for large size data sets RBF will be more useful, since PNN/GRNN are impractical. | ||

+ | |||

+ | ====A brief introduction to the K-means algorithm==== | ||

+ | K-means is a commonly applied technique in clustering, which aims to divide <math>\,n</math> observations into <math>\,k</math> groups by computing the distance from each of individual observations to the <math>\,k</math> cluster centers. A typical K-means algorithm can be described as follows: | ||

+ | |||

+ | Step1: Select <math>\,k</math> as the number of clusters | ||

+ | |||

+ | Step2: Randomly select <math>\,k</math> observations from the <math>\,n</math> observations, to be used as <math>\,k</math> initial centers. | ||

+ | |||

+ | Step3: For each data point from the rest of observations, compute the distance to each of the <math>\,k</math> initial centers and classify it into the cluster with the minimum distance. | ||

+ | |||

+ | Step4: Obtain updated <math>\,k</math> cluster centers by computing the mean of all the observations in the corresponding clusters. | ||

+ | |||

+ | Step5: Repeat Step 3 and Step 4 until all of the differences between the old cluster centers and new cluster centers are acceptable. | ||

+ | ====Typical Radial Function==== | ||

+ | |||

+ | Gaussian : | ||

+ | |||

+ | <math>\ \phi(r) = e^{- \frac{r^{2}}{2 \sigma^2}} </math> | ||

− | + | Hardy Multi-quadratic : | |

+ | |||

+ | <math>\ \phi(r) = \frac{\sqrt{r^2+c^2}}{c} , c>0 </math> | ||

+ | |||

+ | Hardy Multi-quadratic : | ||

+ | |||

+ | <math>\ \phi(r) = \frac{c}{\sqrt{r^2+c^2}} , c>0 </math> | ||

==== Reference for the above paragraph ==== | ==== Reference for the above paragraph ==== | ||

Line 2,690: | Line 3,486: | ||

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.548&rep=rep1&type=pdf] | [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.548&rep=rep1&type=pdf] | ||

− | 2. GA-RBF: A | + | 2. GA-RBF: A self-optimising RBF network |

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.7406&rep=rep1&type=pdf] | [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.7406&rep=rep1&type=pdf] | ||

Line 2,711: | Line 3,507: | ||

==== RBF Network's Hidden Layer ==== | ==== RBF Network's Hidden Layer ==== | ||

− | The hidden layer has <math>\, m</math> neurons, where the optimal number can be determined using | + | The hidden layer has <math>\, m</math> neurons, where the optimal number for <math>\, m</math> can be determined using cross validation techniques discussed in the previous section. |

− | For example, if the data is generated from mixture of Gaussian distribution, you can cluster the data and estimate each Gaussian distribution mean and variance by EM algorithm. Their mean and variance can be used for constructing the basis functions. Each neuron consists of a basis function of an input layer point <math>\underline x_{i}</math> referred to as <math>\,\Phi_{j}(\underline x_{i}) </math> where <math>\, j \in \{1 ... m\}</math> and <math>\, i \in \{1 ... n\}</math>. <br> | + | For example, if the data is generated from mixture of Gaussian distribution, you can cluster the data and estimate each Gaussian distribution mean and variance by [http://en.wikipedia.org/wiki/Expectation-maximization_algorithm EM algorithm]. Their mean and variance can be used for constructing the basis functions. Each neuron consists of a basis function of an input layer point <math>\underline x_{i}</math> referred to as <math>\,\Phi_{j}(\underline x_{i}) </math> where <math>\, j \in \{1 ... m\}</math> and <math>\, i \in \{1 ... n\}</math>. <br> |

* Note: In the following section, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>\,k = 1</math>, <math>\,\hat Y</math> and <math>\,W</math> are column vectors. <br> | * Note: In the following section, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>\,k = 1</math>, <math>\,\hat Y</math> and <math>\,W</math> are column vectors. <br> | ||

Line 2,721: | Line 3,517: | ||

* Note: An RBF function <math>\Phi</math> is a real-valued function whose value depends only on the distance from a centre <math>\underline c</math>, such that <math>\Phi(\underline x,\underline c) = \Phi(\|\underline x - \underline c \|)</math>. Other commonly used radial basis functions are Multiquadric, Polyharmonic spline, and Thin plate spline. | * Note: An RBF function <math>\Phi</math> is a real-valued function whose value depends only on the distance from a centre <math>\underline c</math>, such that <math>\Phi(\underline x,\underline c) = \Phi(\|\underline x - \underline c \|)</math>. Other commonly used radial basis functions are Multiquadric, Polyharmonic spline, and Thin plate spline. | ||

− | |||

:<math>\Phi_{n,m} = \left[ \begin{matrix} | :<math>\Phi_{n,m} = \left[ \begin{matrix} | ||

Line 2,778: | Line 3,573: | ||

<math> \,= E[(f(x) + \epsilon - \hat f(x))^2]</math><br> | <math> \,= E[(f(x) + \epsilon - \hat f(x))^2]</math><br> | ||

<math> \,= E[(f(x) - \hat f(x))^2 + \epsilon^2 - 2\epsilon(f(x) - \hat f(x))]</math><br> | <math> \,= E[(f(x) - \hat f(x))^2 + \epsilon^2 - 2\epsilon(f(x) - \hat f(x))]</math><br> | ||

− | The part of the error term we want to approximate is <math>\, E[(f(x) - \hat f(x))^2] </math>. We will try to estimate this by finding the other terms of the above expression. | + | The part of the error term we want to approximate is <math>\, E[(f(x) - \hat f(x))^2] </math>. We will try to estimate this by finding the other terms of the above expression. See lecture titled "Model Selection for an RBF network", November 2, 2010, below. |

==== Conceptualizing RBF Networks ==== | ==== Conceptualizing RBF Networks ==== | ||

Line 2,795: | Line 3,590: | ||

An Example of RBF Networks [http://reference.wolfram.com/applications/neuralnetworks/ApplicationExamples/12.1.2.html] | An Example of RBF Networks [http://reference.wolfram.com/applications/neuralnetworks/ApplicationExamples/12.1.2.html] | ||

+ | |||

+ | This paper suggests an objective approach in determining proper samples to find good RBF networks with respect to accuracy[http://www.wseas.us/e-library/conferences/2009/hangzhou/MUSP/MUSP41.pdf]. | ||

+ | |||

+ | =====Improvement for RBF Neural Networks Based on Cloud Theory===== | ||

+ | Base on cloud theory, an improved algorithm for RBF neural networks was introduced to transfer the problem of determining the center and its corresponding bandwidth of cluster of RBF to determine the parameters of normal cloud model in order to make the output of each of hidden layers having vague and random properties and the randomness of each of data are kept and passed to the output layer. The conclusion shows that the improved algorithm is superior to the classical RBF in prediction and the actual result is well. Simultaneously, the improved algorithm can be transplanted to the improvement of RBF neural networks algorithms. For more information, see Lingfang Sun, Shouguo Wang, Ce Xu, Dong Ren, Jian Zhang, Research on the improvement for RBF neural networks based on cloud theory, Proceedings of the World Congress on Intelligent Control and Automation (WCICA), pp. 3110-3113, 2008. | ||

+ | |||

+ | === Radial Basis Approximation Implementation === | ||

+ | |||

+ | % This code uses the built-in NEWRB MATLAB function to create a radial basis network that | ||

+ | % approximates a function defined by a set of data points. | ||

+ | |||

+ | % Define 21 inputs P and associated targets T. | ||

+ | |||

+ | P = -1:.1:1; | ||

+ | T = [-.9602 -.5770 -.0729 .3771 .6405 .6600 .4609 ... | ||

+ | .1336 -.2013 -.4344 -.5000 -.3930 -.1647 .0988 ... | ||

+ | .3072 .3960 .3449 .1816 -.0312 -.2189 -.3201]; | ||

+ | plot(P,T,'+'); | ||

+ | title('Training Vectors'); | ||

+ | xlabel('Input Vector P'); | ||

+ | ylabel('Target Vector T'); | ||

+ | |||

+ | % We would like to find a function which fits the 21 data points. One way to do this is with a radial % basis network. A radial basis network is a network with two layers. A hidden layer of radial basis % neurons and an output layer of linear neurons. Here is the radial basis transfer function used by % the hidden layer. | ||

+ | |||

+ | p = -3:.1:3; | ||

+ | a = radbas(p); | ||

+ | plot(p,a) | ||

+ | title('Radial Basis Transfer Function'); | ||

+ | xlabel('Input p'); | ||

+ | ylabel('Output a'); | ||

+ | |||

+ | % The weights and biases of each neuron in the hidden layer define the position | ||

+ | % and width of a radial basis function. Each linear output neuron forms a | ||

+ | % weighted sum of these radial basis functions. With the correct weight and | ||

+ | % bias values for each layer, and enough hidden neurons, a radial basis network | ||

+ | % can fit any function with any desired accuracy. This is an example of three | ||

+ | % radial basis functions are scaled and summed to produce a function | ||

+ | |||

+ | a2 = radbas(p-1.5); | ||

+ | a3 = radbas(p+2); | ||

+ | a4 = a + a2*1 + a3*0.5; | ||

+ | plot(p,a,'b-',p,a2,'b--',p,a3,'b--',p,a4,'m-') | ||

+ | title('Weighted Sum of Radial Basis Transfer Functions'); | ||

+ | xlabel('Input p'); | ||

+ | ylabel('Output a'); | ||

+ | |||

+ | [[File:rbf1.jpg]] | ||

+ | |||

+ | % The function NEWRB quickly creates a radial basis network which approximates | ||

+ | % the function defined by P and T. In addition to the training set and targets, | ||

+ | % NEWRB takes two arguments, the sum-squared error goal and the spread constant. | ||

+ | |||

+ | eg = 0.02; % sum-squared error goal | ||

+ | sc = 1; % spread constant | ||

+ | net = newrb(P,T,eg,sc); | ||

+ | |||

+ | % To see how the network performs, replot the training set. Then simulate the | ||

+ | % network response for inputs over the same range. Finally, plot the results on | ||

+ | % the same graph. | ||

+ | |||

+ | plot(P,T,'+'); | ||

+ | xlabel('Input'); | ||

+ | X = -1:.01:1; | ||

+ | Y = sim(net,X); | ||

+ | hold on; | ||

+ | plot(X,Y); | ||

+ | hold off; | ||

+ | legend({'Target','Output'}) | ||

+ | |||

+ | === Linear Basis Network === | ||

+ | A piece-wise linear trajectory maybe modeled more easily and meaningfully using linear bases, rather than Gaussians. To replace Gaussian with linear, one may need to replace the clustering algorithm and modify the basis function. C-varieties algorithm developed by H.H.Bock and J.C.Bezdek is a good way for finding linear clusters (for details on this algorithm you may want to refer to the Fuzzy Cluster Analysis, by Frank Hoeppner et al). And then the resulting linear functions should be accompanied with some Gaussian functions, or any other localizing functions to localize the working area of each one of the linear clusters. The combination is a general function approximator, just like the RBF, NN, etc. | ||

+ | |||

+ | <math>\begin{align}\hat{y}=\sum_{i=1}^{P}f_i(x)\Phi_i(x) \end{align}</math> | ||

+ | |||

+ | Where in this equation, <math>\ f_i(x)</math> is a linear function, coming the clustering stage and corresponding to the ith cluster and <math>\ \Phi_i</math> is the validity or localizing function, corresponding to the ith cluster. | ||

== '''Model Selection for RBF Network (Stein's Unbiased Risk Estimator) - November 2nd, 2010''' == | == '''Model Selection for RBF Network (Stein's Unbiased Risk Estimator) - November 2nd, 2010''' == | ||

Line 2,803: | Line 3,673: | ||

− | However, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does not necessarily result in a smaller testing error. In practice, one often observes that up to a certain point the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too much by increasing the model complexity, the testing error often can take a dramatic turn and begin to increase. This was explained and | + | However, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does not necessarily result in a smaller testing error. In practice, one often observes that up to a certain point the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too much by increasing the model complexity, the testing error often can take a dramatic turn and begin to increase. This behavior was explained and related figures illustrating this concept were provided in the lecture on complexity control on October 26th. |

− | [[File:data_noise.jpg|500px|thumb|right|Figure 1. Data sampled from a smooth function (in black) cannot be over-fit. Data sampled from a smooth function with noise (in red) can be over-fit when the noise is | + | [[File:data_noise.jpg|500px|thumb|right|Figure 1. Data sampled from a smooth function (in black) cannot be over-fit. Data sampled from a smooth function with noise (in red) can be over-fit when the noise is modeled along with the smooth function.]] |

− | The basic reason behind this phenomenon of the training and testing errors is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to the training data at the expense of losing generality. As seen in Figure 1, the red data points have been over-fit as the general form of the underlying smooth function has been lost in the red-curve model. In the extreme case, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model will fit the training data set perfectly. However, the perfectly-fit model fails to be as accurate or perform as well on the training data set because it has | + | The basic reason behind this phenomenon of the training and testing errors is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to the training data at the expense of losing generality. As seen in Figure 1, the red data points have been over-fit as the general form of the underlying smooth function has been lost in the red-curve model. In the extreme case, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model will fit the training data set perfectly. However, the perfectly-fit model fails to be as accurate or perform as well on the training data set because it has modeled not only the true function <math>\displaystyle f(X)</math> but the random noise as well, and thus has over-fit the data (as the red curve in Figure 1 has done). It is interesting to note that in the case of no noise, over-fitting will not occur and hence the complexity of the model can be increased without bound. However, this is not realistic in practice as random noise is almost always present in the data. |

− | + | All in all, we can expect the training error will be an overly optimistic estimate of the testing error. An obvious way to estimate testing error is to add a penalty term to the training error to compensate for the difference. SURE, a technique developed by Charles Stein, a professor of statistics at Stanford University, is based on this idea. | |

===Stein's unbiased risk estimate (SURE)=== | ===Stein's unbiased risk estimate (SURE)=== | ||

− | Stein's unbiased risk estimate (SURE) is an unbiased estimator of the mean-squared error of a given estimator in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely. A standard application of SURE is to choose a parametric form for an estimator, and then optimize the values of the parameters to minimize the risk estimate | + | Stein's unbiased risk estimate (SURE) is an unbiased estimator of the mean-squared error of a given estimator in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely. A standard application of SURE is to choose a parametric form for an estimator, and then optimize the values of the parameters to minimize the risk estimate. |

− | For more information about the relation between Stein's unbiased risk estimator and Stein's lemma refer to[ http://www.cc.gatech.edu/~lebanon/notes/sure.pdf]. The following is the description of Stein's lemma and its use to derive Stein's unbiased risk estimator (SURE). | + | Stein’s unbiased risk estimation (SURE) theory gives a rigorous definition of the degrees of freedom for any fitting procedure. [http://www.ams.org/mathscinet-getitem?mr=0630098]. For more information about the relation between Stein's unbiased risk estimator and Stein's lemma refer to[http://www.cc.gatech.edu/~lebanon/notes/sure.pdf]. The following is the description of Stein's lemma and its use to derive Stein's unbiased risk estimator (SURE). |

− | Note that the material presented here is applicable to model selection in general, and is not specific to RBF networks. | + | Note that the material presented here is applicable to model selection in general, and is not specific to RBF networks. |

+ | |||

+ | ===Applications of Stein's unbiased risk estimate=== | ||

+ | A standard application of SURE is to choose a parametric form for an estimator, and then optimize the values of the parameters to minimize the risk estimate. This technique has been applied in several settings. For example, a variant of the James–Stein estimator[http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator] can be derived by finding the optimal shrinkage estimator. The technique has also been used by Donoho and Johnstone to determine the optimal shrinkage factor in a wavelet denoising setting [http://www.jstor.org/sici?sici=0162-1459%28199512%2990%3A432%3C1200%3AATUSVW%3E2.0.CO%3B2-K]. | ||

+ | SURE has also been used for optical flow estimation by Mingren Shi [http://www.sci.usq.edu.au/research/seminars/files//seminar1/OpSureTalk.pdf]. | ||

====Important Notation [http://en.wikipedia.org/wiki/Stein's_unbiased_risk_estimate]==== | ====Important Notation [http://en.wikipedia.org/wiki/Stein's_unbiased_risk_estimate]==== | ||

Line 2,910: | Line 3,784: | ||

This means that equation <math>\displaystyle (1)</math> now becomes, for one data point: | This means that equation <math>\displaystyle (1)</math> now becomes, for one data point: | ||

− | <math>\displaystyle E[(\hat | + | <math>\displaystyle E[(\hat y-y)^2 ]=E[(\hat f-f)^2]+\sigma^2-2\sigma^2E\left[\frac {\partial \hat f}{\partial y}\right]</math>. |

Line 2,941: | Line 3,815: | ||

<math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1}\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=m</math>, by the trace cyclical permutation property, where <math>\displaystyle m</math> is the number of basis functions in the RBF network (and hence <math>\displaystyle \Phi</math> has dimension <math>\displaystyle n \times m</math>).<br> | <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1}\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=m</math>, by the trace cyclical permutation property, where <math>\displaystyle m</math> is the number of basis functions in the RBF network (and hence <math>\displaystyle \Phi</math> has dimension <math>\displaystyle n \times m</math>).<br> | ||

− | ====Sketch of | + | ====Sketch of Trace Cyclical Property Proof:==== |

For <math>\, A_{mn}, B_{nm}, Tr(AB) = \sum_{i=1}^{n}\sum_{j=1}^{m}A_{ij}B_{ji} = \sum_{j=1}^{m}\sum_{i=1}^{n}B_{ji}A_{ij} = Tr(BA)</math>.<br> | For <math>\, A_{mn}, B_{nm}, Tr(AB) = \sum_{i=1}^{n}\sum_{j=1}^{m}A_{ij}B_{ji} = \sum_{j=1}^{m}\sum_{i=1}^{n}B_{ji}A_{ij} = Tr(BA)</math>.<br> | ||

With that in mind, for <math>\, A_{nn}, B_{nn} = CD, Tr(AB) = Tr(ACD) = Tr(BA)</math> (from above) <math>\, = Tr(CDA)</math>.<br><br> | With that in mind, for <math>\, A_{nn}, B_{nn} = CD, Tr(AB) = Tr(ACD) = Tr(BA)</math> (from above) <math>\, = Tr(CDA)</math>.<br><br> | ||

Line 2,948: | Line 3,822: | ||

− | + | Substituting <math>\sum_{i=1}^n \,H_{ii} = m+1</math> into equation <math>\displaystyle (3)</math> gives the following: | |

<math>\displaystyle Err=err-n\sigma^2+2\sigma^2(m+1)</math>. | <math>\displaystyle Err=err-n\sigma^2+2\sigma^2(m+1)</math>. | ||

Line 2,964: | Line 3,838: | ||

====Reference:==== | ====Reference:==== | ||

− | Automatic basis selection for RBF networks using Stein’s unbiased risk estimator | + | * Automatic basis selection for RBF networks using Stein’s unbiased risk estimator [http://www.google.ca/url?sa=t&source=web&cd=2&sqi=2&ved=0CB4QFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.5.5344%26rep%3Drep1%26type%3Dpdf&rct=j&q=Stein%27s%20Unbiased%20Risk%20Estimator%29%20RBF&ei=YsHSTKzgDYausAO-4IWrCw&usg=AFQjCNHO9oFBQ6tljsEqdLOjFgtiQz9gxQ&sig2=Cx9Sh0Uk-h8pDgihKkU_HA&cad=rja.pdf] |

− | [http://www.google.ca/url?sa=t&source=web&cd=2&sqi=2&ved=0CB4QFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.5.5344%26rep%3Drep1%26type%3Dpdf&rct=j&q=Stein%27s%20Unbiased%20Risk%20Estimator%29%20RBF&ei=YsHSTKzgDYausAO-4IWrCw&usg=AFQjCNHO9oFBQ6tljsEqdLOjFgtiQz9gxQ&sig2=Cx9Sh0Uk-h8pDgihKkU_HA&cad=rja.pdf] | + | * J. Moody and C. J. Darken, "Fast learning in networks of locally tuned processing units," Neural Computation, 1, 281-294 (1989). Also see [http://www.ki.inf.tu-dresden.de/~fritzke/FuzzyPaper/node5.html Radial basis function networks according to Moody and Darken] |

+ | * T. Poggio and F. Girosi, "Networks for approximation and learning," Proc. IEEE 78(9), 1484-1487 (1990). | ||

+ | * Roger D. Jones, Y. C. Lee, C. W. Barnes, G. W. Flake, K. Lee, P. S. Lewis, and S. Qian, [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=137644 Function approximation and time series prediction with neural networks], Proceedings of the International Joint Conference on Neural Networks, June 17-21, p. I-649 (1990). | ||

+ | * John R. Davies, Stephen V. Coggeshall, Roger D. Jones, and Daniel Schutzer, "Intelligent Security Systems," | ||

+ | * S. Chen, C. F. N. Cowan, and P. M. Grant, "Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks", IEEE Transactions on Neural Networks, Vol 2, No 2 (Mar) 1991. | ||

====Further Reading:==== | ====Further Reading:==== | ||

Line 2,980: | Line 3,858: | ||

Estimation of the Mean of a Multivariate Normal Distribution [http://www.jstor.org/pss/2240405] | Estimation of the Mean of a Multivariate Normal Distribution [http://www.jstor.org/pss/2240405] | ||

+ | |||

+ | =====Generalized SURE for Exponential Families===== | ||

+ | As we know, Stein’s unbiased risk estimate (SURE) is limited to be applied for the independent, identically distributed (i.i.d.) Gaussian model. However, in some recent work, some researchers tried to work on obtaining a SURE counterpart for general, instead of deriving estimate by dominating least-squares estimation, and this technique made SURE extend its application to a wider area. In 2009, Yonina C. Eldar from Department of Electrical Engineering Technion, Israel Institute of Technology published her paper, in which a new method for choosing regularization parameters in penalized LS estimators was introduced to design estimates without predefining their structure and its application can be proved to have superior performance over the conventional generalized cross validation and discrepancy approaches in the context of image deblurring and deconvolution. For more information, see Yonina C. Eldar, Generalized SURE for Exponential Families: Applications to Regularization, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 2, FEBRUARY 2009. | ||

== '''Regularization for Neural Network - November 4, 2010'''== | == '''Regularization for Neural Network - November 4, 2010'''== | ||

− | + | ||

+ | Large weights in a neural network can hurt generalization in two different ways. First, excessively large weights leading to hidden units can cause the output function to be too rough, possibly with near discontinuities. Second, excessively large weights leading to output units can cause wild outputs far beyond the range of the data if the output activation function is not bounded to the same range as the data. To put it another way, large weights can cause excessive variance of the output. The effort to reduce the size of these weights is called regularization. | ||

+ | |||

+ | Training of the weights in a neural network is usually accomplished by iteratively developing them from some regularized set of small initial values. The weights tend to increase in absolute values as training proceeds. When neural networks were first developed, weights were prevented from getting too large by simply stopping the training session early; to determine when to stop training the neural network, a set of test data was used to detect overfitting. Using this method, the stopping point would be determined by finding the length of training time that results in minimal classification error for the test set. However, in this section, a somewhat different for regularization method is presented that does not require the training session to be terminated early; rather, this method directly penalizes overfitting in the optimization calculation. | ||

=== ''' Weight decay'''=== | === ''' Weight decay'''=== | ||

− | Weight decay is a subset of regularization methods. The penalty term in | + | Weight decay is a subset of regularization methods, which aim to prevent overfitting in model selection. The penalty term in |

weight decay, by definition, penalizes large weights. Other regularization | weight decay, by definition, penalizes large weights. Other regularization | ||

methods may involve not only the weights but various derivatives of the | methods may involve not only the weights but various derivatives of the | ||

output function [http://research.microsoft.com/en-us/um/people/cmbishop/nnpr.htm]. | output function [http://research.microsoft.com/en-us/um/people/cmbishop/nnpr.htm]. | ||

+ | The weight decay penalty term causes the weights to converge to smaller | ||

+ | absolute values than they otherwise would. | ||

[[File:figure 2.png|350px|thumb|right|Figure 3: activation function]] | [[File:figure 2.png|350px|thumb|right|Figure 3: activation function]] | ||

Weight decay training is suggested as a method useful in achieving a robust [http://en.wikipedia.org/wiki/Neural_network neural network] which is insensitive to noise. Since the number of hidden layers in a NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting. | Weight decay training is suggested as a method useful in achieving a robust [http://en.wikipedia.org/wiki/Neural_network neural network] which is insensitive to noise. Since the number of hidden layers in a NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting. | ||

− | The weight–decay method is an effective way to improve the generalization ability of neural networks. In general, the trained weights are constrained to be small when the weight-decay method is applied. Large weights | + | The weight–decay method is an effective way to improve the generalization ability of neural networks. In general, the trained weights are constrained to be small when the weight-decay method is applied. Large weights in the output layer can cause outputs that are far beyond the range of the data (when test data is used); in other words, large weights can result in high output variance. |

It can be seen from Figure 3 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. That is, the operative part of a sigmoid function is almost linear for small weights. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, and we can avoid overfitting by constraining the weights to be small. This gives us a hint on why we initialize the random weights to be close to zero. If the weights are large, the model is more complex and the activation function tends to be nonlinear. | It can be seen from Figure 3 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. That is, the operative part of a sigmoid function is almost linear for small weights. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, and we can avoid overfitting by constraining the weights to be small. This gives us a hint on why we initialize the random weights to be close to zero. If the weights are large, the model is more complex and the activation function tends to be nonlinear. | ||

− | Note that it is not necessarily bad to go to the nonlinear section of the activation function. In fact we use nonlinear activation functions to increase the ability of neural networks and make it possible to estimate nonlinear functions. What we must avoid is using the nonlinear section more than required, which would result in overfitting the training data. | + | Note that it is not necessarily bad to go to the nonlinear section of the activation function. In fact, we use nonlinear activation functions to increase the ability of neural networks and make it possible to estimate nonlinear functions. What we must avoid is using the nonlinear section more than required, which would result in overfitting of the training data. To achieve this we add a penalty term to the error function. |

− | + | The usual penalty is the sum of squared weights times a decay constant. In a linear model, this form of weight decay is equivalent to ridge regression [http://komarix.org/ac/papers/thesis/thesis_html/node15.html]. Now the regularized error function becomes: | |

Line 3,016: | Line 3,902: | ||

<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u^{old}\right)</math> | <math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u^{old}\right)</math> | ||

+ | |||

To conclude, the weight decay penalty term lead the weights to converge to smaller | To conclude, the weight decay penalty term lead the weights to converge to smaller | ||

Line 3,022: | Line 3,909: | ||

near discontinuities. Excessively large weights leading to output units can | near discontinuities. Excessively large weights leading to output units can | ||

cause wild outputs far beyond the range of the data if the output activation | cause wild outputs far beyond the range of the data if the output activation | ||

− | function is not bounded to the same range as the data. In another words, large weights can cause large variance of the output [http://portal.acm.org/citation.cfm?id=148062]. According to [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.2302], the size ( | + | function is not bounded to the same range as the data. In another words, large weights can cause large variance of the output [http://portal.acm.org/citation.cfm?id=148062]. According to [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.2302], the size (L1- |

norm) of the weights is more important than the number of weights in | norm) of the weights is more important than the number of weights in | ||

determining generalization. | determining generalization. | ||

Line 3,032: | Line 3,919: | ||

<math>\,\lambda</math>is different for different types of weights in the NN. We can have different <math>\,\lambda</math> for input-to-hidden, hidden-to-hidden, and hidden-to-output weights. | <math>\,\lambda</math>is different for different types of weights in the NN. We can have different <math>\,\lambda</math> for input-to-hidden, hidden-to-hidden, and hidden-to-output weights. | ||

− | For more reading about the effect of weight decay training for backpropagation on noisy data sets please refer to [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6T08-3TYVWK9-F-P&_cdi=4856&_user=1067412&_pii=S089360809800046X&_origin=search&_coverDate=08%2F31%2F1998&_sk=999889993&view=c&wchp=dGLbVzW-zSkzS&md5=52846ec8e0ba54b28000ef1de34c7bc5&ie=/sdarticle.pdf] and how weight decay can improve generalization in feed forward network refer to [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.4221] | + | The following Matlab code implements a neural network with weight decay using the backpropagation method: |

− | + | ||

+ | %initialize | ||

+ | m; %features | ||

+ | n; %datapoints | ||

+ | nodes; %number of hidden nodes | ||

+ | u1=rand(m,nodes)-0.5; %input-to-hidden weights | ||

+ | u2=rand(nodes,1)-0.5; %hidden-to-output weights | ||

+ | ro; %learning rate | ||

+ | weightPenalty; | ||

+ | |||

+ | Z_output=zeros(n,1); | ||

+ | Z=zeros(nodes,n); | ||

+ | |||

+ | %% TRAIN DATA | ||

+ | for epoch=1:100 | ||

+ | |||

+ | for i=1:n | ||

+ | %% Forward Pass | ||

+ | %determine inputs to hidden layer | ||

+ | A=u1'*training_X(:,i); | ||

+ | |||

+ | %apply activation function to hidden layer weighted inputs | ||

+ | for j=1:nodes | ||

+ | Z(j,i)=1/(1+exp(-A(j))); | ||

+ | end | ||

+ | |||

+ | %apply weights to get fitted outputs; | ||

+ | Z_output(i,1) = u2'*Z(:,i); | ||

+ | |||

+ | %% Backward Pass | ||

+ | %output delta | ||

+ | delta_O = -2*(training_Y(i)-Z_output(i,1)); | ||

+ | |||

+ | %tweak the hidden-output weights | ||

+ | for j=1:nodes | ||

+ | u2(j)=u2(j)-ro*(delta_O*Z(j)+2*weightPenalty*u2(j)); | ||

+ | end | ||

+ | for j=1:nodes | ||

+ | sigmaPrime=exp(-A(j))/(1+exp(-A(j)))^2; | ||

+ | delta_H = sigmaPrime*delta_O*u2(j); | ||

+ | u1(:,j)=u1(:,j)-ro*(delta_H*training_X(:,i)+2*weightPenalty*u1(:,j)); | ||

+ | end | ||

+ | end | ||

+ | yhat(:,1) = Z_output(:,1) > 0.5; | ||

+ | end | ||

+ | |||

+ | |||

+ | For more reading about the effect of weight decay training for backpropagation on noisy data sets please refer to [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6T08-3TYVWK9-F-P&_cdi=4856&_user=1067412&_pii=S089360809800046X&_origin=search&_coverDate=08%2F31%2F1998&_sk=999889993&view=c&wchp=dGLbVzW-zSkzS&md5=52846ec8e0ba54b28000ef1de34c7bc5&ie=/sdarticle.pdf] and how weight decay can improve generalization in feed forward network refer to [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.4221] | ||

+ | |||

+ | |||

+ | A fundamental problem with weight decay is that different types of weights | ||

+ | in the network will usually require different decay constants for good | ||

+ | generalization. At the very least, you need three different decay constants | ||

+ | for input-to-hidden, hidden-to-hidden, and hidden-to-output weights. | ||

+ | Adjusting all these decay constants to produce the best estimated | ||

+ | generalization error often requires vast amounts of computation. | ||

+ | |||

+ | Fortunately, there is a superior alternative to weight decay: hierarchical | ||

+ | Bayesian learning. Bayesian learning makes it possible to estimate | ||

+ | efficiently numerous decay constants. | ||

+ | |||

+ | Weight decay is proposed to reduce overfitting as it often | ||

+ | appears in the learning tasks of artificial neural networks. For example, in [http://www.springerlink.com/content/f21781218007l750/fulltext.pdf] weight decay is applied to a well | ||

+ | defined model system based on a single | ||

+ | layer perceptron, which exhibits strong overfitting. Since the optimal | ||

+ | non-overfitting solution is known for this system, they have compared the effect | ||

+ | of the weight decay with this solution. A strategy to find the optimal | ||

+ | weight decay strength is proposed, which leads to the optimal solution | ||

+ | for any number of examples. | ||

+ | |||

====Methods to estimate the weight decay parameter==== | ====Methods to estimate the weight decay parameter==== | ||

− | One of the biggest problems in weight decay | + | One of the biggest problems in weight decay regularization of neural networks is how to estimate its parameter. There are many ways proposed in the literature to estimate the weight decay parameter. |

+ | |||

+ | Typically,the weight decay parameter is set between 0.001 and 0.1 that is based on that is based on network training .An inappropriate estimate of the decay parameter may cause over-fitting or over smoothing . Determining the correct value of the parameter is a very tedious process which needs a lot of trial and error . Typically, the optimal value of the weight decay is determined by training the network many times .That is, performing network training based on the same set of initial weights ,same network configuration with fixed number of neutral layers , and fit the network with various weight decay parameters . Then determine the optimal value of weight decay values by the smallest generalization error. | ||

+ | |||

+ | The following papers are good start for some one who is looking for further reading. | ||

1- On the selection of weight decay parameter for faulty networks [http://portal.acm.org/citation.cfm?id=1862025 here] | 1- On the selection of weight decay parameter for faulty networks [http://portal.acm.org/citation.cfm?id=1862025 here] | ||

2- A Simple Trick for Estimating the Weight Decay Parameter [http://www.springerlink.com/content/0889d07ufuwgql03/ here] | 2- A Simple Trick for Estimating the Weight Decay Parameter [http://www.springerlink.com/content/0889d07ufuwgql03/ here] | ||

− | |||

===Regularization invariant under transformation=== | ===Regularization invariant under transformation=== | ||

Line 3,051: | Line 4,010: | ||

Many approaches have been devised so that, when regularization is used during the training process of a network, the resulting predictions would be invariant under any transformation(s) made to the input variable(s). One such approach is to add a regularization term to the error function that serves to penalize any possible changes to the outputs resulting from any transformation(s) applied to the inputs. A common example of this approach is [http://arts.uwaterloo.ca/~cnrglab/?q=system/files/tangent_prop.pdf tangent propagation], which is described in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf slides] and which is discussed in detail in Simard ''et al.'''s 1998 [http://yann.lecun.com/exdb/publis/pdf/simard-98.pdf paper] regarding transformation invariance. Several other approaches are also described in Sargur Srihari's slides. | Many approaches have been devised so that, when regularization is used during the training process of a network, the resulting predictions would be invariant under any transformation(s) made to the input variable(s). One such approach is to add a regularization term to the error function that serves to penalize any possible changes to the outputs resulting from any transformation(s) applied to the inputs. A common example of this approach is [http://arts.uwaterloo.ca/~cnrglab/?q=system/files/tangent_prop.pdf tangent propagation], which is described in Sargur Srihari's [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf slides] and which is discussed in detail in Simard ''et al.'''s 1998 [http://yann.lecun.com/exdb/publis/pdf/simard-98.pdf paper] regarding transformation invariance. Several other approaches are also described in Sargur Srihari's slides. | ||

+ | === Other Alternatives for NN Regularization === | ||

+ | As enumerated before, there are some drawbacks for the weight decay, which makes it necessary to think of some other regularization methods. Consistent Gaussian priors, early Stopping, invariances, tangent propagation, training with transformed data, convolutional networks, and soft weight sharing are some other alternatives for neural network regularization. A zealous reader may find a great deal of information about these topics in the Pattern Recognition and Machine Learning by Christopher M. Bishop (chapter 5, section 5). | ||

− | + | ==='''Further reading'''=== | |

The generalization ability of the network can depend crucially on the decay constant, especially with small training sets. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. | The generalization ability of the network can depend crucially on the decay constant, especially with small training sets. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. | ||

Line 3,060: | Line 4,021: | ||

Fortunately, there is a superior alternative to weight decay: hierarchical Bayesian learning. Bayesian learning makes it possible to estimate efficiently numerous decay constants.For information about bayesian learning, please refer to [http://en.wikipedia.org/wiki/Bayesian_inference Bayesian inference] | Fortunately, there is a superior alternative to weight decay: hierarchical Bayesian learning. Bayesian learning makes it possible to estimate efficiently numerous decay constants.For information about bayesian learning, please refer to [http://en.wikipedia.org/wiki/Bayesian_inference Bayesian inference] | ||

+ | |||

+ | [http://books.google.ca/books?id=jFAbzhrDqRcC&pg=PA1125&lpg=PA1125&dq=regularization+in+neural+networks+weight+decay&source=bl&ots=6YX8KIhxyO&sig=Dcwk5Y1_LPvtLhukEx3gDcVNEik&hl=en&ei=b0HzTLbfBYmgnwfv-5mXCg&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCIQ6AEwATgK#v=onepage&q&f=false] | ||

===='''References'''==== | ===='''References'''==== | ||

Line 3,073: | Line 4,036: | ||

4. Sargur Srihari. ''Regularization in Neural Networks'' slides. [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf] | 4. Sargur Srihari. ''Regularization in Neural Networks'' slides. [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf] | ||

+ | |||

+ | 5. Neural Network Modeling using SAS Enterprise Miner [http://www.sasenterpriseminer.com/neural_networks.htm] | ||

+ | |||

+ | 6. Valentina Corradi, Halbert White ''Regularized neural networks: some convergence rate results'' [http://portal.acm.org/citation.cfm?id=211706] | ||

+ | |||

+ | 7. Bayesian Regularization in a Neural Network Model to Estimate Lines of Code Using Function Points [http://www.scipub.org/fulltext/jcs/jcs14505-509.pdf] | ||

+ | |||

+ | 7. An useful pdf introducing regularization of neural network: [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf] | ||

=='''Support Vector Machine - November 09, 2010'''== | =='''Support Vector Machine - November 09, 2010'''== | ||

===Introduction=== | ===Introduction=== | ||

− | Through the course we have seen different methods for solving linearly separable problems, e.g.: Linear regression, LDA, Neural Networks. In most cases we can find | + | |

+ | |||

+ | Through the course we have seen different methods for solving linearly separable problems, e.g.: Linear regression, LDA, Neural Networks. In most cases, we can find many linear boundaries for a problem which separate classes (see figure 1) and all have the same training error. A question arises: which of these boundaries is optimal and has minimum true error? The answer to this question leads to a new type of classifiers called [http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machines (SVM)]. | ||

SVMs are a set of supervised learning methods. | SVMs are a set of supervised learning methods. | ||

− | The original algorithm was proposed by | + | The original algorithm was proposed by Vladimir [http://en.wikipedia.org/wiki/Vapnik Vapnik] and later formulated to what is in current literature by Corinna Cortes and Vapnik. The modern history of SVM can be traced to 1974 when the field of [http://www.econ.upf.edu/~lugosi/mlss_slt.pdf statistical learning theory] was pioneered by [http://en.wikipedia.org/wiki/Vladimir_Vapnik Vladimir Vapnik] and [http://en.wikipedia.org/wiki/Alexey_Chervonenkis Alexey Chervonenkis]. In 1979, SVM was established when Vapnik further developed statistical learning theory and wrote a book in 1979 documenting his works. Since Vapnik's 1979 book was written in Russian, SVM did not become popular until Vapnik immigrated to the US and, in 1982, translated his 1979 book into English. More of SVM's history can be found in this [http://www.svms.org/history.html link]. |

+ | |||

+ | |||

+ | The current standard incarnation of SVM is known as "soft margin" and was proposed by Corinna Cortes and Vladimir Vapnik [http://en.wikipedia.org/wiki/Vladimir_Vapnik]. In practice the data is not usually linearly separable. Although theoretically we can make the data linearly separable by mapping it into higher dimensions, the issues of how to obtain the mapping and how to avoid overfitting are still of concern. A more practical approach to classifying non-linearly separable data is to add some error tolerance to the separating hyperplane between the two classes, meaning that a data point in class A can cross the separating hyperplane into class B by a certain specified distance. This more generalized version of SVM is the so-called "soft margin" support vector machine and is generally accepted as the standard form of SVM over the hard margin case in practice today. [http://en.wikipedia.org/wiki/Support_vector_machine#Soft_margin] | ||

+ | |||

+ | |||

+ | SVM was introduced after neural networks and gathered attention by outperforming neural networks in many applications e.g. bioinformatics, text, image recognition, etc. It retained popularity until recently, when the notion of deep network (introduced by Hinton) outperformed SVM in some applications. A support vector machine constructs a hyperplane which can be used as a classification boundary. These linear decision boundaries explicitly try to separate the data into different classes while maximizing the margin of separation. Intuitively, if we are dealing with separable data clusters, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point(s) from each of the classes since in general the larger the margin the lower is the generalization error of the classifier, i.e. the lower is the probability that a new data point would be misclassified into the wrong class. | ||

− | + | The techniques that make the extensions to the non-linearly separable case, where the classes overlap no matter what linear boundary is created, are generalized to what is known as the "kernel support vector machine". Kernel SVM produces a nonlinear boundary by constructing a linear boundary in a higher-dimensional space and transformed feature space. This non-linear boundary is a linear boundary in the transformed feature space obtained by application of kernel, making kernel SVM a linear classifier just as the original form of SVM. | |

+ | |||

+ | |||

+ | No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. SVM can also be considered a special case of [http://en.wikipedia.org/wiki/Tikhonov_regularization Tikhonov regularization]. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). Another key feature of SVM, as discussed below, is the use of [http://en.wikipedia.org/wiki/Slack_variable slack variables] to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data. SVM has been used successfully in many real-world problems: | ||

- Pattern Recognition (Face Detection [17], Face Verification [18], Object Recognition [19], Handwritten Character/Digit Recognition [20], Speaker/Speech Recognition [21], Image Retrieval [22], Prediction [23]) | - Pattern Recognition (Face Detection [17], Face Verification [18], Object Recognition [19], Handwritten Character/Digit Recognition [20], Speaker/Speech Recognition [21], Image Retrieval [22], Prediction [23]) | ||

Line 3,097: | Line 4,079: | ||

For a complete list of SVM application please refer to [http://www.clopinet.com/isabelle/Projects/SVM/applist.html]. | For a complete list of SVM application please refer to [http://www.clopinet.com/isabelle/Projects/SVM/applist.html]. | ||

− | ===Optimal | + | ===Optimal Separating Hyperplane=== |

− | As can be seen in | + | As can be seen in Figure 1, there exists an infinite number of linear hyperplanes between the two classes. A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. |

− | The data points which are indicated in | + | The data points which are indicated in the green circles in Figure 2 are the data points that are on the boundary of the margin and are called Support Vectors. |

[[File:Yyy.png|250px|thumb|right|Fig. 1 Linear Classifiers]] | [[File:Yyy.png|250px|thumb|right|Fig. 1 Linear Classifiers]] | ||

[[File:Xxx.png|250px|thumb|right|Fig. 2 Maximum Margin]] | [[File:Xxx.png|250px|thumb|right|Fig. 2 Maximum Margin]] | ||

Line 3,109: | Line 4,091: | ||

Figure 3 shows the linear algebra of the hyperplane, where <math>\,d_i</math> is the distance between the origin and a point <math>\,x_i</math> on the hyperplane. | Figure 3 shows the linear algebra of the hyperplane, where <math>\,d_i</math> is the distance between the origin and a point <math>\,x_i</math> on the hyperplane. | ||

− | Suppose hyperplane is defined as <math>\displaystyle \beta^{T}x+\beta_0=0</math> as shown in figure 3 and suppose that data is linearly separable and <math>\displaystyle y_i \in \{-1,1 \} </math>. | + | Suppose a hyperplane is defined as <math>\displaystyle \beta^{T}x+\beta_0=0</math>, as shown in figure 3, and suppose that the data is linearly separable and <math>\displaystyle y_i \in \{-1,1 \} </math>. Where <math>\displaystyle \beta_0</math> is the distance of the hyperplane to the origin. |

− | |||

+ | Property 1: <math>\displaystyle \beta </math> is orthogonal to the hyperplane | ||

− | + | Suppose that <math>\displaystyle x_1,x_2</math> are <br />lying on the hyperplane. Then we have | |

− | <math>\displaystyle \beta </math> | + | : <math>\displaystyle \beta^{T}x_1+\beta_0=0</math> , and |

− | + | : <math>\displaystyle \beta^{T}x_2+\beta_0=0</math> . | |

− | + | Therefore, | |

− | <math>\displaystyle \beta^{T}x_2+\beta_0=0</math> | + | : <math>\displaystyle \beta^{T}x_1+\beta_0 - (\beta^{T}x_2+\beta_0)=0</math> , and |

− | + | : <math>\displaystyle \beta^{T}(x_1-x_2)=0</math> . | |

− | + | Hence, | |

− | + | : <math>\displaystyle \beta \bot \displaystyle (x_1 - x_2)</math> . | |

+ | But <math>\displaystyle x_1-x_2</math> is a vector lying in the hyperplane, since the two points were arbitrary. So, <math>\displaystyle \beta </math> is orthogonal to every vector lying in the hyperplane and by definition orthogonal to hyperplane. | ||

− | 2 | + | |

+ | Property 2: | ||

For any point <math>\displaystyle x_0 </math> on the hyperplane, we can say that | For any point <math>\displaystyle x_0 </math> on the hyperplane, we can say that | ||

− | <math>\displaystyle \beta^{T}x_0+\beta_0=0</math> | + | : <math>\displaystyle \beta^{T}x_0+\beta_0=0</math> and |

+ | |||

+ | : <math>\displaystyle \beta^{T}x_0=-\beta_0</math> . | ||

− | + | For any point on the hyperplane, multiplying by <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. | |

− | For any point on the hyperplane, multiplying by <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/> | + | <br /> |

− | 3 | + | Property 3: |

− | For any point <math>\displaystyle x_i</math> the distance of the point to the hyperplane | + | For any point <math>\displaystyle x_i</math>, let the distance of the point to the hyperplane be denoted by <math>\displaystyle d_i</math>, which is the projection of (<math>\displaystyle x_i - x_0</math>) onto <math>\displaystyle\beta</math>. The signed distance for any point <math>\displaystyle x_i </math> to the hyperplane is <math> \displaystyle d_i = \beta^{T}(x_i - x_0)</math>. Since the length of <math>\displaystyle \beta </math> changes the value of the distance, we can normalize it by dividing <math>\displaystyle \beta </math> into its length. Thus, we get |

− | The signed distance for any point <math>\displaystyle x_i </math> to the hyperplane is <math> \displaystyle d_i = \beta^{T}(x_i - x_0)</math>. | ||

− | <math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math> | + | : <math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math> , |

− | <math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math> | + | : <math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math> by property 2, and |

− | + | : <math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math> . | |

− | + | Therefore, for any point if we want to find it's distance to the hyperplane we simply put it in the above equation. | |

− | |||

− | 4 | + | Property 4: |

− | We use | + | We use labels to make the distance positive. Therefore, let <math>\displaystyle Margin=(y_id_i)</math>. Since we would like to maximize the Margin, we have |

− | |||

− | <math>\displaystyle Margin= | + | : <math>\displaystyle Margin=max(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math> . |

− | <math>\displaystyle | + | Since we now know how to compute <math>\displaystyle d_i </math> , by property 3, |

− | <math>\displaystyle y_i | + | : <math>\displaystyle Margin=max\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} \quad (1)</math> , and |

+ | : <math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\ge 0</math> . | ||

− | Since | + | Since the margin is a distance it is always non-negative. If the point is on the hyperplane, it is zero. Otherwise, it is greater than zero. |

For all training data points <math>\,i</math> that are not on the hyperplane, | For all training data points <math>\,i</math> that are not on the hyperplane, | ||

− | <math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math> | + | : <math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math> . |

− | Let <math> \displaystyle c >0 </math> be the | + | Let <math> \displaystyle c>0 </math> be the minimum distance between the hyperplane and the training data points not on the hyperplane. We have |

− | <math>\ | + | : <math>\, y_i(\beta^{T}x_i+\beta_0)\ge c </math> |

− | + | for all training data points <math> \displaystyle i </math> that are not on the hyperplane. Thus, | |

− | + | : <math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\ge 1</math> . | |

− | <math>\displaystyle | + | This is known as the canonical representation of the decision hyperplane. For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction and the hyperplane will be the same. Thus, |

− | <math>\displaystyle | + | : <math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\ge 1 \quad (2)</math> ,<br /> |

+ | equivalently, as we care only about the direction of the <math>\displaystyle\beta</math>, we can write:<br /> | ||

− | <math>\displaystyle y_i \beta^{T}x_i+\beta_0 \ | + | : <math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq1 </math> <br /><br /> |

− | + | Considering (2) and (1), for the the closest datapoints to the margin (those datapoints, which are placed at the distance 1 to the margin as shown above), (1) becomes:<br /> | |

+ | : <math>\displaystyle Margin=max\{\frac{1}{\|\beta\|}\} </math> | ||

− | |||

− | minimize <math>\displaystyle\frac{1}{2}\|\beta\|^2</math> s.t <math> \displaystyle y_i(\beta^T x_i + \beta_0) \geq 1 \forall</math> i | + | Therefore, in order to maximize the margin we have to minimize the norm of <math>\,\beta</math>. So, we get |

+ | |||

+ | : minimize <math>\displaystyle\|\beta\|^2</math> and | ||

+ | |||

+ | : minimize <math>\displaystyle\frac{1}{2}\|\beta\|^2</math> s.t <math> \displaystyle y_i(\beta^T x_i + \beta_0) \geq 1 \forall</math> i | ||

+ | |||

+ | for the <math>\displaystyle\beta</math> s which have distance greater than or equal to one. | ||

+ | <br /> | ||

+ | we choose to minimize norm 2 of <math>\displaystyle\beta</math> mainly for the sake of simplified optimization. | ||

+ | We have used <math>\displaystyle\frac{1}{2}</math> factor only for convenience in derivation of the derivative. | ||

− | + | ===Writing Lagrangian Form of Support Vector Machine=== | |

− | |||

The Lagrangian form using [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange multipliers] and constraints that are discussed below is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution (the optimal saddle point of the Lagrangian for the [http://en.wikipedia.org/wiki/Quadratic_programming classic quadratic optimization]). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM. | The Lagrangian form using [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange multipliers] and constraints that are discussed below is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution (the optimal saddle point of the Lagrangian for the [http://en.wikipedia.org/wiki/Quadratic_programming classic quadratic optimization]). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM. | ||

Line 3,202: | Line 4,195: | ||

Dual form of the optimization problem: | Dual form of the optimization problem: | ||

− | <math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. | + | : <math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math> . |

+ | |||

+ | To find the optimal value, we set the derivative equal to zero: | ||

− | + | : <math>\,\frac{\partial L}{\partial \beta} = 0</math> and <math>\,\frac{\partial L}{\partial \beta_0} = 0</math> . | |

− | + | Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math>. | |

− | Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math> | ||

− | First, <math>\,\frac{\partial L}{\partial \beta} = 0</math> | + | First, setting <math>\,\frac{\partial L}{\partial \beta} = 0</math>: |

− | <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math> | + | : <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math> , |

− | : <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|^2 = \beta</math> | + | : <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|^2 = \beta</math> , |

− | : <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math> | + | : <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math> , |

− | : <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math> | + | : <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math> , and |

− | : <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>. | + | : <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math> . |

− | So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. | + | So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math> . In other words, |

− | <math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math> | + | : <math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math> and <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math> . |

− | Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = \sum_{i=1}^n{\alpha_iy_i} = 0</math>. | + | Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = \sum_{i=1}^n{\alpha_iy_i} = 0</math> . |

+ | Thus, our objective function becomes <math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math>, which is a dual representation of the maximum margin. Since <math>\,\alpha_i</math> is the Lagrange multiplier, <math>\,\alpha_i \geq 0 \forall i</math>. Therefore, we have a new optimization problem: | ||

− | + | : <math>\underset{\alpha}{\max} \sum_{i=1}^n{\alpha_i}- \,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} </math> , where | |

+ | : <math>\,\alpha_i \ge 0 \forall i</math> and | ||

− | + | : <math>\,\Sigma_i{\alpha_i y_i} = 0</math> . | |

+ | This is a much simpler optimization problem and we can solve it by [http://en.wikipedia.org/wiki/Quadratic_programming]. Quadratic programming (QP) is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables. | ||

+ | The general form of such a problem is minimize with respect to <math>\,x</math> | ||

+ | : <math>f(x) = \frac{1}{2}x^TQx + c^Tx</math> | ||

+ | subject to one or more constraints of the form <math>\,Ax\le b</math>, <math>\,Ex=d</math>. | ||

− | + | See this [http://www.me.utexas.edu/~jensen/ORMM/supplements/methods/nlpmethod/S2_quadratic.pdf link] for a good description of general QP problem formulation and solution. | |

− | + | ====Remark==== | |

+ | Using Lagrangian Primal/Lagrangian Dual will give us two very different results when formulating this problem: | ||

+ | ------------- | ||

+ | If use Lagrangian Primal: | ||

− | 1) | + | :<math>\,min{L(\beta,\beta_0,\alpha)}</math> |

+ | where | ||

+ | :<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math> | ||

+ | such that | ||

+ | :<math>\,\frac{\partial L}{\partial \beta} = 0</math> , <math>\,\frac{\partial L}{\partial \beta_0} = 0</math> and <math>\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right) = 0, \,\forall\, i</math> | ||

− | + | ------------- | |

+ | If use Lagrangian Dual: | ||

− | + | :<math>\,max{L(\beta,\beta_0,\alpha)}</math> | |

+ | where | ||

+ | :<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math> | ||

+ | such that | ||

+ | :<math>\,\frac{\partial L}{\partial \beta} = 0</math> and <math>\,\frac{\partial L}{\partial \beta_0} = 0</math> (Note: no <math>\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right) = 0, \,\forall\, i</math> constraint anymore) | ||

+ | |||

+ | ------------- | ||

+ | We do not use Lagrangian Primal here because this formulation will generate quadratic constraints, which cannot be solved by 'quadprog.m'. | ||

+ | Click [http://www.icom.nctu.edu.tw/CLSCM/download/Handouts/or706_week13.pdf here] for more details about Lagrangian Duality. | ||

− | + | ===Quadratic Programming Problem of SVMs and Dual Problem=== | |

We have to find <math>\,\beta</math> and <math>\,\beta_0</math> such that <math>\,\frac{1}{2}\|\beta\|^2 </math> is minimized subject to <math> \,y_i (\beta^T x_i + \beta_0) \geq 1 \forall i </math>. | We have to find <math>\,\beta</math> and <math>\,\beta_0</math> such that <math>\,\frac{1}{2}\|\beta\|^2 </math> is minimized subject to <math> \,y_i (\beta^T x_i + \beta_0) \geq 1 \forall i </math>. | ||

− | |||

Therefore, we need to optimize a quadratic function subject to linear constraints. | Therefore, we need to optimize a quadratic function subject to linear constraints. | ||

Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. | Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. | ||

Line 3,258: | Line 4,273: | ||

Lagrange multipliers that maximize the objective function | Lagrange multipliers that maximize the objective function | ||

− | <math>\,Q(\alpha)= \underset{\alpha}{\ | + | : <math>\,Q(\alpha)= \underset{\alpha}{\max} \sum_{i=1}^n{\alpha_i}- \,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} </math> |

subject to the constraints | subject to the constraints | ||

− | + | : <math>\,\alpha_i \ge 0 \forall i</math> and | |

− | + | : <math>\,\Sigma_i{\alpha_i y_i} = 0</math> . | |

+ | |||

+ | =====Feasibility of the Primal and Dual Programming Problems===== | ||

+ | |||

+ | For Hard-Margin SVM, the primal and dual programming problems are feasible with any values of <math>\, y_i </math> and <math>\, x_i</math>. Feasibility depends on satisfaction of the problem's constraints. In the primal problem, the constraint <math> \,y_i (\beta^T x_i + \beta_0) \geq 1 \forall i </math> can be satisfied by setting <math>\, \beta = \underline{0} </math> and by setting <math>\, \beta_0 = \frac{1}{y_i} </math> for any values of <math>\, y_i </math> and <math>\, x_i</math>. In the dual problem, the constrants <math>\,\alpha_i \ge 0 \forall i</math> and <math>\,\Sigma_i{\alpha_i y_i} = 0</math> can be satisfied by setting <math>\,\alpha_i = 0 \forall i</math> for any values of <math>\, y_i </math> and <math>\, x_i</math>. The constaints can always be satisfied, so the programming problems are always feasible. | ||

===Implementation=== | ===Implementation=== | ||

Line 3,271: | Line 4,290: | ||

Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] of the primal and dual problems [10]. Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.Please refer to [http://www.mathworks.ch/help/toolbox/bioinfo/ref/svmtrain.html;jsessionid=q6MgMBHGsKf5hJrBv1H8pZsp4nLjsmnjFhvsGf5Ylnqzqh4fQMpn!2108730516] for code implementation of SVM. | Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] of the primal and dual problems [10]. Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.Please refer to [http://www.mathworks.ch/help/toolbox/bioinfo/ref/svmtrain.html;jsessionid=q6MgMBHGsKf5hJrBv1H8pZsp4nLjsmnjFhvsGf5Ylnqzqh4fQMpn!2108730516] for code implementation of SVM. | ||

− | === | + | === Hard margin SVM Algorithm === |

− | |||

− | |||

− | + | [[image: H-SVM.png ]] | |

− | + | Source: John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, illustrated edition edition, June 2004. | |

− | === | + | ===Multiclass SVM=== |

− | + | SVM is only directly applicable for two-class case. We want to generalize this algorithm to multi-class tasks. Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements. The dominating approach for doing so is to reduce the single multiclass problem into multiple binary problems. Each of the problems yields a binary classifier, which is assumed to produce an output function that gives relatively large values for examples from the positive class and relatively small values for examples belonging to the negative classes. Two common methods to build such binary classifiers are where each classifier distinguishes between (i) one of the labels to the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with most votes determines the instance classification. | |

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | + | LIBSVM is an integrated software for support vector classification, regression and distribution estimation. It supports multi-class classification. | |

+ | [http://www.csie.ntu.edu.tw/~cjlin/libsvm/] | ||

− | + | Koby Cramer and Yoram Singer have proposed a direct way to learn multiclass SVMs. It is described here in their paper: | |

− | + | [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.69.8716 On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines] | |

− | |||

− | |||

− | |||

− | |||

− | |||

− | + | ==== Implements SVM multi-class ==== | |

− | + | Spider is an object orientated environment for machine learning in MATLAB, for unsupervised, supervised or semi-supervised machine learning problems, and includes training, testing, model selection, cross-validation, and statistical tests. Implements SVM multi-class classification and regression. [http://www.kyb.tuebingen.mpg.de/bs/people/spider/ Spider] | |

− | + | ===Support Vector Machines vs Artificial Neural Networks=== | |

− | + | The development of ANNs followed a heuristic path, with applications and extensive experimentation preceding theory. In contrast, the development of SVMs involved sound theory first, then implementation and experiments. A significant advantage of SVMs is that whilst ANNs can suffer from multiple local minima, the solution to an SVM is global and unique. Two more advantages of SVMs are that they have a simple geometric interpretation and also a sparse solution. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space. ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. The reason that SVMs often outperform ANNs in practice is that they deal with the biggest problem with ANNs, SVMs are less prone to over-fitting since their solution is sparse. In contrast to neural networks SVMs automatically select their model size (by selecting the Support vectors)(Rychetsky (2001)).While the weight decay term is an important aspect for obtaining good generalization in the context of neural networks for regression, the gamma parameter (in soft-margin SVM) that is discussed below plays a somewhat similar role in classification problems. | |

− | |||

− | |||

− | |||

− | |||

===SVM packages=== | ===SVM packages=== | ||

Line 3,325: | Line 4,324: | ||

A pretty long list of other SVM packages and comparison between all of them in terms of language, execution platform, multiclass and regression capabilities, is available [http://www.cs.ubc.ca/~murphyk/Software/svm.htm here]. | A pretty long list of other SVM packages and comparison between all of them in terms of language, execution platform, multiclass and regression capabilities, is available [http://www.cs.ubc.ca/~murphyk/Software/svm.htm here]. | ||

+ | |||

+ | The top 3 SVM software are: | ||

+ | |||

+ | 1. LIBSVM | ||

+ | |||

+ | 2. SVMlight | ||

+ | |||

+ | 3. SVMTorch | ||

+ | |||

+ | Also, there are other two web pages introducing SVM software and their comparison: [http://www.svms.org/software.html] and [http://www.support-vector-machines.org/SVM_soft.html]. | ||

===References=== | ===References=== | ||

Line 3,372: | Line 4,381: | ||

22. A. Fan and M. Palaniswami, Selecting bankruptcy predictors using a support vector machine approach, vol. 6, pp. 354-359, 2000. | 22. A. Fan and M. Palaniswami, Selecting bankruptcy predictors using a support vector machine approach, vol. 6, pp. 354-359, 2000. | ||

+ | |||

+ | 23. Joachims, T. Text categorization with support vector machines. Technical report, LS VIII Number 23, University of Dortmund, 1997. ftp://ftp-ai.informatik.uni-zortmund.de/pub/Reports/report23.ps.Z. | ||

==''' Support Vector Machine Cont., Kernel Trick - November 11, 2010'''== | ==''' Support Vector Machine Cont., Kernel Trick - November 11, 2010'''== | ||

− | Recall in the previous lecture that instead of solving the primal problem of maximizing the margin, we can solve the dual problem without changing the solution as long as it subjects to the [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker] (KKT) conditions. Leading to the following: | + | Recall in the previous lecture that instead of solving the primal problem of maximizing the margin, we can solve the dual problem without changing the solution as long as it subjects to the [http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker] (KKT) conditions. KKT are the first-order conditions on the gradient for an optimal point. Leading to the following: |

− | <math>\ | + | <math>\max_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> |

:such that <math>\,\alpha_i \ge 0 \forall i</math> | :such that <math>\,\alpha_i \ge 0 \forall i</math> | ||

:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math> | :and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math> | ||

− | We are looking to | + | We are looking to maximize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details). |

If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. | If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. | ||

Line 3,387: | Line 4,398: | ||

We can write the Lagrangian equation in matrix form: | We can write the Lagrangian equation in matrix form: | ||

− | <math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math> | + | <math>\max_{\alpha} L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math> |

:such that <math>\underline{\alpha} \geq \underline{0}</math> | :such that <math>\underline{\alpha} \geq \underline{0}</math> | ||

:and <math>\underline{\alpha}^T\underline{y} = 0</math> | :and <math>\underline{\alpha}^T\underline{y} = 0</math> | ||

Line 3,393: | Line 4,404: | ||

Where: | Where: | ||

* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math> | * <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math> | ||

− | * Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T( | + | * Matrix <math>S(i,j) = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_jx_j)</math> |

* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively | * <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively | ||

Line 3,455: | Line 4,466: | ||

ineqlin: [3x1 double] | ineqlin: [3x1 double] | ||

+ | =====Upper bound for Hard Margin SVM in MATLAB's Quadprog===== | ||

+ | |||

+ | The theoretical optimization problem for Hard Margin SVM provided at the beginning of this lecture shows that <math>\, \underline{\alpha}</math> is given no upper bound. However, when this problem is implemented using MATLAB's <code>quadprog</code>, there are some cases where the correct solution is not produced. It is the experience of students who have taken this course that, for some Hard-Margin SVM problems, <code>quadprog</code> unexpectedly outputs a solution that has all elements of <math>\, \underline{\alpha}</math> set to zero. Of course, some elements of <math>\, \underline{\alpha}</math> need to be non-zero in order for the support vectors to be determined, so this all-zero result is obviously incorrect. One numerical trick that seems to correct this trouble with <code>quadprog</code> is to set the upper bound on the elements of <math>\, \underline{\alpha}</math> to a large number like <math>\, 10^6</math>. This can be done by setting the eighth parameter of <code>quadprog</code> to a vector with the same dimension of <math>\, \underline{\alpha}</math> and with all elements having a value of <math>\, 10^6</math>. | ||

===Examining K.K.T. conditions=== | ===Examining K.K.T. conditions=== | ||

Line 3,496: | Line 4,510: | ||

Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''. | Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''. | ||

− | The optimal hyperplane is determined by only a few support vectors. | + | The optimal hyperplane is determined by only a few support vectors. Since it is impossible for us to know a priori which of the training data points would end up as the support vectors, it is necessary for us to work with the entire training set to find the optimal hyperplane. |

===The support vector machine algorithm=== | ===The support vector machine algorithm=== | ||

− | # Solve the quadratic programming problem:<math>\ | + | # Solve the quadratic programming problem:<math>\max_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0 \forall{i}</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math> <br/> (Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math>) |

# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math> | # Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math> | ||

# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math> | # Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math> | ||

− | + | ===Advantages of Support Vector Machines=== | |

− | The support vector machine algorithm is insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin — support vectors — contribute. Hence the model given by SVM is entirely defined by the support vectors, which is a very small subset of the entire training set. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors, instead of fitting a set of parameters. | + | * SVMs provide a good out-of-sample generalization. This means that, by choosing an appropriate generalization grade, |

+ | SVMs can be robust, even when the training sample has some bias. This is mainly due to selection of optimal hyperplane. | ||

+ | * SVMs deliver a unique solution, since the optimality problem is convex. This is an advantage compared | ||

+ | to Neural Networks, which have multiple solutions associated with local minima and for this reason may | ||

+ | not be robust over different samples. | ||

+ | *State-of-the-art accuracy on many problems. | ||

+ | *SVM can handle any data types by changing the kernel. | ||

+ | *The support vector machine algorithm is insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin — support vectors — contribute. Hence the model given by SVM is entirely defined by the support vectors, which is a very small subset of the entire training set. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors, instead of fitting a set of parameters. | ||

References: | References: | ||

Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3 | Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3 | ||

+ | |||

+ | |||

+ | ===Disadvantages of Support Vector Machines [http://www.cse.unr.edu/~bebis/MathMethods/SVM/lecture.pdf]=== | ||

+ | |||

+ | *Perhaps the biggest limitation of the support vector approach lies in choice of the kernel (Which we will study about in future). | ||

+ | |||

+ | *A second limitation is speed and size, both in training and testing (mostly in training - for large training sets, it typically selects a small number of support vectors, thereby minimizing the computational requirements during testing). | ||

+ | |||

+ | *Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained. | ||

+ | |||

+ | *The optimal design for multiclass SVM classifiers is a further area for research. | ||

+ | |||

+ | *Although SVMs have good generalization performance, they can be abysmally slow in test phase. | ||

+ | |||

+ | *Besides the advantages of SVMs - from a practical point of view - they have some drawbacks. An important practical question that is not entirely solved, is the selection of the kernel function parameters - for Gaussian kernels the width parameter [sigma] - and the value of [epsilon] in the [epsilon]-insensitive loss function. | ||

+ | |||

+ | *However, from a practical point of view perhaps the most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks. | ||

+ | |||

+ | ===Applications of Support Vector Machines=== | ||

+ | |||

+ | The following papers describe some of the possible applications of support vector machines: | ||

+ | |||

+ | 1- Training support vector machines: an application to face detection [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=609310 here] | ||

+ | |||

+ | 2- Application of support vector machines in financial time series forecasting [http://svms.org/regression/TaCa01.pdf here] | ||

+ | |||

+ | 3- Support vector machine active learning with applications to text classification [http://portal.acm.org/citation.cfm?id=944793&dl=GUIDE, here] | ||

+ | |||

+ | |||

+ | |||

+ | Note that SVMs start from the goal of separating the data with a hyperplane, and could be extended to non-linear decision boundaries using the kernel trick. | ||

===Kernel Trick=== | ===Kernel Trick=== | ||

+ | {{Cleanup|date=November 2010|reason=It would be better to provide a link to exact proof of the fact that of we project data into high dimensional space then data will become linearly separable.}} | ||

− | We talked about the curse of dimensionality at the beginning of this course | + | {{Cleanup|date=November 2010|reason=I dont know if such a proof exists, would this not depend on the data whether or not a high dimensional projection would make the data linearly seperable}} |

+ | |||

+ | Fig.1 shows how transforming the data into the higher dimension makes it linearly separable. | ||

+ | |||

+ | [[File:dimensoin.png|350px|thumb|right|Fig1.Transforming the data can make it linearly separable.]] | ||

+ | |||

+ | We talked about the [http://www.armyconference.org/ACAS00-02/ACAS02ShortCourse/ACASCourse10.pdf curse of dimensionality] at the beginning of this course. However, we now turn to the power of high dimensions in order to find a hyperplane between two classes of data points that can linearly separate the transformed (mapped) data in a space that has a higher dimension than the space in which the training data points reside. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the [http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf kernel trick] is basically to map data to a higher dimension in which the mapped data are linearly separable by a hyperplane, even if the original data are not linearly separable. | ||

[[File:Point_2d.png|200px|thumb|right|Imagine the point is a person. They're stuck.]] | [[File:Point_2d.png|200px|thumb|right|Imagine the point is a person. They're stuck.]] | ||

Line 3,520: | Line 4,579: | ||

[[File:Sep2.png|200px|thumb|right|After a simple transformation, a perfect classification plane can be found.]] | [[File:Sep2.png|200px|thumb|right|After a simple transformation, a perfect classification plane can be found.]] | ||

− | The original optimal hyperplane algorithm proposed by [http://en.wikipedia.org/wiki/Vladimir_Vapnik Vladimir Vapnik] in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The algorithm is very similar, except that every dot product is replaced by a non-linear kernel function as below. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. We have seen SVM as a linear classification problem that finds the maximum margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised in order to solve the same linear classification problem but in a higher dimensional space, a | + | The original optimal hyperplane algorithm proposed by [http://en.wikipedia.org/wiki/Vladimir_Vapnik Vladimir Vapnik] in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The algorithm is very similar, except that every dot product is replaced by a non-linear kernel function as below. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. We have seen SVM as a linear classification problem that finds the maximum margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised in order to solve the same linear classification problem but in a higher dimensional space, a [http://en.wikipedia.org/wiki/Feature_space feature space], under which the maximum margin hyperplane is better suited. |

Let <math>\,\phi</math> be a mapping, | Let <math>\,\phi</math> be a mapping, | ||

− | <math>\phi:\mathbb{R}^d \rightarrow \mathbb{R}^D </math><br /><br /> | + | <math>\phi:\mathbb{R}^d \rightarrow \mathbb{R}^D </math>, where <math>\,D > d</math>.<br /><br /> |

− | We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are | + | We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are led to solve the previous constrained quadratic optimization on the transformed dataset,<br /><br /> |

− | <math>\ | + | <math>\max_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /> |

The solution to this optimization problem is now well known; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible. | The solution to this optimization problem is now well known; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible. | ||

Line 3,536: | Line 4,595: | ||

<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /> | <math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /> | ||

− | Where K is | + | Where K is a ''kernel function'' in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use a kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space (such as a [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] in SVM without even explicitly knowing the function <math>\,\phi</math>. |

* one may look at <math>\,x_i^T x_j</math> as way of measuring similarity, where <math>\,K(\underline{x}_i,\underline{x}_j) </math> is another way of measuring similarity between <math>\,x_i </math> and <math>\,x_j</math> | * one may look at <math>\,x_i^T x_j</math> as way of measuring similarity, where <math>\,K(\underline{x}_i,\underline{x}_j) </math> is another way of measuring similarity between <math>\,x_i </math> and <math>\,x_j</math> | ||

− | ====Popular kernel choices for SVM==== | + | Available [http://www.youtube.com/watch?v=3liCbRZPrZA here] is a a short but interesting and informative video by Udi Aharoni that illustrates how kernel SVM uses a kernel to map non-linearly-separable original data to a higher-dimensional space and then finding a hyperplane in that space that linearly separates the implicitly mapped data, and how this hyperplane ultimately translates to a non-linear decision boundary in the original space that classifies the original data. |

+ | |||

+ | |||

+ | ====Popular kernel choices for SVM==== | ||

There are many types of kernels that can be used in Support Vector Machines models. These include linear, polynomial and radial basis function (RBF). | There are many types of kernels that can be used in Support Vector Machines models. These include linear, polynomial and radial basis function (RBF). | ||

− | + | linear: <math>\ K(\underline{x}_{i},\underline{x}_{j})= \underline{x}_{i}^T\underline{x}_{j}</math>, | |

+ | |||

+ | polynomial: <math>\ K(\underline{x}_{i},\underline{x}_{j})= (\gamma\underline{x}_{i}^T\underline{x}_{j}+r)^{d}, \gamma > 0</math>, | ||

+ | |||

+ | radial Basis: <math>\ K(\underline{x}_{i},\underline{x}_{j})= exp(-\gamma \|\underline{x}_i - \underline{x}_j\|^{2}), \gamma > 0</math>, | ||

+ | |||

+ | Gaussian: <math>\ K(x_i,x_j)=exp(\frac{-||x_i-x_j||^2}{2\sigma^2 })</math>, | ||

− | + | hyperbolic tangent: <math>\ K(x_i,x_j)=tanh(k_1\underline{x}_{i}^T\underline{x}_{j}+k_2)</math>, | |

− | + | The RBF kernel is by far the most popular choice of kernel types used in Support Vector Machines. This is mainly because of their localized and finite responses across the entire range of the real x-axis.The art of flexible modeling using basis expansions consists of picking an appropriate family of basis functions, and then controlling the complexity of the representation by selection, regularization, or both. Some of the families of basis functions have elements that are defined locally; for example, <math>\displaystyle B</math>-splines are defined locally in <math>\displaystyle R</math>. If more flexibility is desired in a particular region, then that region needs to be represented by more basis functions(which in the case of <math>\displaystyle B</math>-splines translates to more knots). Kernel methods achieve flexibility by fitting simple models in a region local to the target point <math>\displaystyle x_0</math>. Localization is achieved via a weighting kernel <math>\displaystyle K</math> and individual observations receive weights <math>\displaystyle K(x_0,x_i)</math>. The RBF kernel combines these ideas, by treating the kernel functions as basis functions. | |

− | |||

− | + | Kernels can also be constructed from other kernels using the following rules: | |

+ | |||

+ | <br> | ||

+ | Let a(x,x') , b(x,x') both be kernel functions<br> | ||

+ | <math> k(x,x') = ca(x,x') \forall c > 0 </math> | ||

+ | <br> | ||

+ | <math> k(x,x') = f(x)a(x,x')f(x') \forall</math> functions f(x) | ||

+ | <br> | ||

+ | <math> k(x,x') = p(a(x,x')) \forall </math> polynomial functions p with non negative coefficients | ||

+ | <br> | ||

+ | <math>\, k(x,x') = e^{a(x,x')} </math> | ||

+ | <br> | ||

+ | <math>\, k(x,x') = a(x,x') + b(x,x') </math> | ||

+ | <br> | ||

+ | <math>\, k(x,x') = a(x,x')b(x,x') </math> | ||

+ | <br> | ||

+ | <math> k(x,x') = k3(\phi(x),\phi(x')) \forall </math> valid kernels k3 over the dimension of <math>\phi(x)</math> | ||

− | + | <math> k(x,x') = x^{T}Ax' \forall A \succeq 0 </math> | |

+ | <br> | ||

+ | <math>\, k(x,x') = k_c(x_c,x_d') + k_d(x_d,x_d') </math> where <math>\, x_c, x_d </math> are variables with <math>\, x = (x_c,x_d) </math> and where <math>\, k_c, k_d </math> are valid kernel functions | ||

+ | <br> | ||

+ | <math>\, k(x,x') = k_c(x_c,x_a')k_d(x_d,x_d') </math> where <math>\, x_c, x_d </math> are variables with <math>\, x = (x_c,x_d) </math> and where <math>\, k_c, k_d </math> are valid kernel functions | ||

+ | <br> | ||

+ | Once we have chosen the Kernel function, we don't need to figure out what <math>\,\phi</math> is, just use <math>\,\phi(\underline{x}_i)^T\phi(\underline{x}_j) = K(\underline{x}_i,\underline{x}_j) </math> to replace <math>\,\underline{x}_i^T\underline{x}_j</math>. | ||

− | + | Reference for rules: [[http://www.cedar.buffalo.edu/~srihari/CSE574/Chap6/Chap6.1-KernelMethods.pdf Rules]] | |

Since the transformation chosen is dependent on the shape of the data, the only automated way to choose an appropriate kernel is by trial and error. Otherwise it is chosen manually. | Since the transformation chosen is dependent on the shape of the data, the only automated way to choose an appropriate kernel is by trial and error. Otherwise it is chosen manually. | ||

+ | |||

+ | ====Kernel Functions for Machine Learning Applications==== | ||

+ | Beyond the kernel functions we discussed in class (Linear Kernel, Polynomial Kernel and Gaussian Kernel), many more kernel functions can be used in the application of kernel methods for machine learning. Some examples of other kernels are: Exponential Kernel, Laplacian Kernel, ANOVA Kernel, Hyperbolic Tangent (Sigmoid) Kernel, Rational Quadratic Kernel, Multiquadric Kernel, Inverse Multiquadric Kernel, Circular Kernel, Spherical Kernel, Wave Kernel, Power Kernel, Log Kernel, Spline Kernel, B-Spline Kernel, Bessel Kernel, Cauchy Kernel, Chi-Square Kernel, Histogram Intersection Kernel, Generalized Histogram Intersection Kernel, Generalized T-Student Kernel, Bayesian Kernel, Wavelet Kernel, etc. For more details, see http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_functions. | ||

===Example in Matlab=== | ===Example in Matlab=== | ||

The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines. | The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines. | ||

+ | |||

+ | Note: in Matlab R2009, the svmtrain function returns support vectors which should not be marked as such. This could be due to svmtrain centering and rescaling the data automatically. The issue has been fixed in later versions. | ||

load 2_3; | load 2_3; | ||

Line 3,582: | Line 4,676: | ||

yh = svmclassify(svmStruct, data(test,:), 'showPlot', true, 'Kernel_Function','rbf'); | yh = svmclassify(svmStruct, data(test,:), 'showPlot', true, 'Kernel_Function','rbf'); | ||

− | + | ===Support Vector Machines as a Regression Technique=== | |

+ | The idea of support vector machines has been also applied on regression problems, called [http://svms.org/regression/ support vector regression]. Still it contains all the main features that characterize maximum margin algorithm: a non-linear function is leaned by linear learning machine mapping into high dimensional kernel induced feature space. The capacity of the system is controlled by parameters that do not depend on the dimensionality of feature space. In the same way as with classification approach there is motivation to seek and optimize the generalization bounds given for regression. They relied on defining the loss function that ignores errors, which are situated within the certain distance of the true value. This type of function is often called – epsilon intensive – loss function. The figure below shows an example of one-dimensional linear regression function with – epsilon intensive – band. The variables measure the cost of the errors on the training points. These are zero for all points that are inside the band (you may want to continue reading this in [http://kernelsvm.tripod.com/ here]). | ||

+ | |||

+ | Here are some papers and works in this matter, by [http://svms.org/regression/SmSc98.pdf A. J. Smola, B. Scholkopf], and [http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/papers/SVR_WellingsNote.pdf M. Welling]. | ||

+ | |||

+ | === 1-norm support vector regression === | ||

+ | |||

+ | [[image: Norm 1.png]] | ||

+ | |||

+ | Pseudocode for 1-norm support vector regression | ||

+ | |||

+ | Source: John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, illustrated edition edition, June 2004. | ||

+ | |||

+ | === 2-norm support vector regression === | ||

+ | |||

+ | [[image: Norm 2.png]] | ||

+ | |||

+ | Pseudocode for 2-norm support vector regression | ||

− | === | + | Source: John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, illustrated edition edition, June 2004. |

+ | |||

+ | ===Extension:Support Vector Machines=== | ||

+ | |||

+ | ==== Pattern Recognition ==== | ||

[http://research.microsoft.com/en-us/um/people/cburges/papers/svmtutorial.pdf] | [http://research.microsoft.com/en-us/um/people/cburges/papers/svmtutorial.pdf] | ||

− | This paper talks about linear Support Vector Machines for separable and non-separable data by working through a non-trivial example in detail, and also it describes a mechanical analog and when SVM solutions are unique and when they are global. From this paper we can know support vector training can be practically implemented, and the kernel mapping technique which is used to construct | + | This paper talks about linear Support Vector Machines for separable and non-separable data by working through a non-trivial example in detail, and also it describes a mechanical analog and when SVM solutions are unique and when they are global. From this paper we can know support vector training can be practically implemented, and the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. |

− | SVM solutions which are nonlinear in the data. | ||

Results of some experiments which were inspired by these arguments are also presented. | Results of some experiments which were inspired by these arguments are also presented. | ||

The writer gives numerous examples and proofs of most of the key theorems, he hopes the people can find old material is cast in a fresh light since the paper includes some new material. | The writer gives numerous examples and proofs of most of the key theorems, he hopes the people can find old material is cast in a fresh light since the paper includes some new material. | ||

+ | ==== Emotion Recognition ==== | ||

+ | Moreover, Linear Support Vector Machine (LSVM) is used in emotion recognition from facial expression and voice of subjects. In this approach, different emotional expressions of each subject are extracted. Then, LSVM is used to classify the extracted feature vectors into different emotion classes.[4] | ||

+ | |||

+ | === Further reading === | ||

+ | The following are few papers in which different approaches and further explanation on support vector machines are made: | ||

+ | |||

+ | 1- Least Squares Support Vector Machine Classifiers [http://www.springerlink.com/content/n75178640w32646j/ here] | ||

+ | |||

+ | 2- Support vector machine classification and validation of cancer tissue samples using microarray expression data [http://bioinformatics.oxfordjournals.org/content/16/10/906.abstract here] | ||

− | ====References | + | 3- Support vector machine active learning for image retrieval [http://portal.acm.org/citation.cfm?id=500159 here] |

+ | |||

+ | 4- Support vector machine learning for interdependent and structured output spaces [http://portal.acm.org/citation.cfm?id=1015341&dl=GUIDE, here] | ||

+ | |||

+ | ===References=== | ||

1. The genetic kernel support vector machine: Description and evaluation | 1. The genetic kernel support vector machine: Description and evaluation | ||

Line 3,603: | Line 4,730: | ||

3. Classification using intersection kernel support vector machines is efficient | 3. Classification using intersection kernel support vector machines is efficient | ||

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4587630] | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4587630] | ||

+ | |||

+ | 4. Das, S.; Halder, A.; Bhowmik, P.; Chakraborty, A.; Konar, A.; Janarthanan, R.; ,[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5393891&isnumber=5393306 A support vector machine classifier of emotion from voice and facial expression data], Nature & Biologically Inspired Computing, 2009. NaBIC 2009. World Congress on , vol., no., pp.1010-1015, 9-11 Dec. 2009. | ||

+ | |||

+ | == ''' Support Vector Machine, Kernel Trick - Cont. Case II - November 16, 2010 ''' == | ||

+ | |||

+ | ==='''Case II: Non-separable data (Soft Margin)'''=== | ||

+ | |||

+ | |||

+ | We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. In this, one may want to separate the training data set with the minimal number of errors . To address this problem, we slacken the classification rule to allow data cross the margin. Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes: | ||

+ | [[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]] | ||

+ | |||

+ | :<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math> | ||

+ | :<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> | ||

+ | :<math>\xi_i \geq 0</math> | ||

+ | |||

+ | <br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane. | ||

+ | |||

+ | <br\>Minimizing the objective, one finds some minimal subset of errors. If these error data are excluded from the training data set, then one can separate the remaining part of training data without errors. | ||

+ | |||

+ | <br\>Soft-margin SVM is a generalization of hard-margin SVM. The hard-margin SVM was presented here before the soft-margin SVM because hard-margin SVM is more intuitive and it was historically conceived first. Note that as <math>\,\gamma \Rightarrow \infty </math>, all <math>\,\xi_i \Rightarrow 0</math>. Intuitively, we would set <math>\,\xi_i = 0</math> if it was known that the classes were separable, that is, if we knew a hard-margin SVM boundary could be used to separate the classes. As shown later in this lecture, the softmargin classifier with <math>\,\gamma = \infty </math> is, in fact, the same optimization problem as the hardmargin classifier mentioned in the previous lecture. By similar logic, it is wise to set higher <math>\,\gamma</math>, for sets that are more separable. | ||

+ | |||

+ | |||

+ | With the formulation of the Primal form for non-separable case above, we can form the Lagrangian. | ||

+ | |||

+ | ===Forming the Lagrangian=== | ||

+ | In this case we have have two constraints in the [http://en.wikipedia.org/wiki/Lagrangian Lagrangian] primal form and therefore we optimize with respect to two dual variables <math>\,\alpha</math> and <math>\,\lambda</math>,<br> | ||

+ | :<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i} \xi_i - \sum_{i} \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i} \lambda_i \xi_i</math> | ||

+ | :<math>\alpha_i \geq 0, \lambda_i \geq 0</math> | ||

+ | |||

+ | Now we apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions. | ||

+ | |||

+ | ===Applying KKT conditions[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions]=== | ||

+ | # <math>\frac{\partial L}{\partial p} = 0</math> at an optimal solution <math>\, \hat p</math>, for each primal variable <math>\,p = \{\beta, \beta_0, \xi\}</math><br><math>\frac{\partial L}{\partial \beta}=\beta - \sum_{i} \alpha_i y_i x_i = 0 \Rightarrow \beta=\sum_{i}\alpha_i y_i x_i</math> <br\><math>\frac{\partial L}{\partial \beta_0}=-\sum_{i} \alpha_i y_i =0 \Rightarrow \sum_{i} \alpha_i y_i =0</math> since the sign does not make a difference<br><math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math>. This is the only new condition added here | ||

+ | #<math>\,\alpha_i \geq 0, \lambda_i \geq 0</math>, dual feasibility | ||

+ | #<math>\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0</math> and <math>\,\lambda_i\xi_i=0</math> | ||

+ | #<math>\,y_i( \beta^T x_i+ \beta_0)-1+ \xi_i \geq 0</math> | ||

+ | |||

+ | === Objective Function === | ||

+ | With our KKT conditions and the Lagrangian equation, <math>\,\alpha</math> could be estimated by Quadratic programming. | ||

+ | |||

+ | <br\> Similar to what we did for the separable case after applying KKT conditions, replace the primal variables in terms of dual variables into the Lagrangian equations and simplify as follows: | ||

+ | |||

+ | :<math>L = \frac{1}{2} |\beta|^2 + \gamma \sum_{i} \xi_i - \beta^T \sum_{i} \alpha_i y_i x_i - \beta_0 \sum_{i} \alpha_i | ||

+ | |||

+ | y_i | ||

+ | |||

+ | + \sum_{i} \alpha_i - \sum_{i} \alpha_i \xi_i - \sum_{i} \lambda_i \xi_i</math> | ||

+ | |||

+ | From KKT conditions: | ||

+ | :<math> \beta = \sum_{i} \alpha_i y_i x_i \Rightarrow \beta^T\beta = |\beta|^2</math> and | ||

+ | :<math> \displaystyle \sum_{i} \alpha_i y_i = 0</math> | ||

+ | |||

+ | Rewriting the above equation we have: | ||

+ | |||

+ | :<math>L = \frac{1}{2} |\beta|^2 - |\beta|^2 + \gamma \sum_{i} \xi_i + \sum_{i} \alpha_i - \sum_{i} \alpha_i \xi_i - | ||

+ | |||

+ | \sum_{i} \lambda_i \xi_i</math> | ||

+ | |||

+ | We know that <math>\frac{1}{2} |\beta|^2 - |\beta|^2 = -\frac{1}{2} |\beta|^2 = - \frac{1}{2} \sum_{i} \sum_{j} \alpha_i | ||

+ | |||

+ | \alpha_j y_i y_j x_i^T x_i </math> | ||

+ | |||

+ | |||

+ | :<math>\Rightarrow L = - \frac{1}{2} \sum_{i} \sum_{j} \alpha_i\alpha_j y_i y_j x_i^T x_j + \sum_{i} \alpha_i + \sum_{i} | ||

+ | |||

+ | \gamma \xi_i - \sum_{i} \alpha_i \xi_i - \sum_{i} \lambda_i \xi_i</math> | ||

+ | |||

+ | :<math>\Rightarrow L = - \frac{1}{2} \sum_{i} \sum_{j} \alpha_i\alpha_j y_i y_j x_i^T x_j + \sum_{i} \alpha_i + \sum_{i} | ||

+ | |||

+ | (\gamma - \alpha_i - \lambda_i) \xi_i</math> | ||

+ | |||

+ | We know that by KKT condition <math>\displaystyle \gamma - \alpha_i - \lambda_i = 0 </math> | ||

+ | |||

+ | Finally we have the simplest form of Lagrangian for non-separable case: | ||

+ | |||

+ | :<math>L = \sum_{i} \alpha_i - \frac{1}{2} \sum_{i} \sum_{j} \alpha_i\alpha_j y_i y_j x_i^T x_j </math> | ||

+ | |||

+ | |||

+ | You can see that there is no difference in objective function of Hard & Soft Margin. Now let's see the constraints for above objective function. | ||

+ | |||

+ | === Constraints === | ||

+ | Following will be the constraints of above objective function: | ||

+ | |||

+ | :<math>\,\alpha_i \geq 0 \forall i</math> | ||

+ | :<math>\lambda_i \geq 0 \forall i</math> | ||

+ | :<math>\displaystyle \sum_{i} \alpha_i y_i = 0</math><br /> | ||

+ | |||

+ | From the KKT conditions above, we have:<br /> | ||

+ | <math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math><br /> | ||

+ | |||

+ | Therefore, If <math>\displaystyle \lambda_i \ge 0 \,\Rightarrow \, \alpha_i \le \gamma</math>, hence, <math>\,\lambda_i \geq 0 </math> constraint can be replaced by <math>\displaystyle \alpha_i \le \gamma</math>. | ||

+ | |||

+ | ===Dual Problem or Quadratic Programming Problem=== | ||

+ | |||

+ | We have formalized the Dual Problem which is as follows: | ||

+ | |||

+ | :<math>\displaystyle \max_{\alpha_i} \sum_{i}{\alpha_i} - \frac{1}{2}\sum_{i}{\sum_{j}{\alpha_i \alpha_j y_i y_j x_i^T x_j}}</math> | ||

+ | |||

+ | subject to the constraints | ||

+ | :<math> \displaystyle 0 \le \alpha_i \le \gamma </math> and | ||

+ | :<math>\displaystyle \sum_{i}{\alpha_i y_i} = 0</math> | ||

+ | |||

+ | You can see that the only difference in the Hard and Soft Margin is the upper bound of <math>\displaystyle \alpha</math> i.e. <math>\displaystyle \alpha \le \gamma</math>. | ||

+ | |||

+ | As <math>\displaystyle \gamma \rightarrow \infty </math> soft margin <math>\displaystyle \rightarrow</math> Hard margin. | ||

+ | |||

+ | === Recovery of Hyperplane === | ||

+ | |||

+ | We can easily recover the hyperplane <math>\displaystyle \underline \beta^T \underline x + \beta_0 = 0</math> by finding the values of <math>\displaystyle \underline \beta</math> and <math>\displaystyle \beta_0</math>. | ||

+ | |||

+ | * <math>\displaystyle \underline \beta</math> can be calculated from first KKT condition i.e. <math>\displaystyle \underline \beta = \sum_{i} \alpha_i y_i \underline x_i</math> | ||

+ | |||

+ | * <math>\displaystyle \beta_0</math> can be calculated by choosing a point that satisfy <math> \displaystyle 0 < \alpha_i \le \gamma </math>, then third KKT condition becomes | ||

+ | :: <math>\displaystyle y_i( \underline \beta^T \underline x_i+ \beta_0)=1</math> which can be solved for <math>\displaystyle \beta_0</math> | ||

+ | |||

+ | ===SVM algorithm for non-separable data sets=== | ||

+ | |||

+ | The algorithm, for non-separable data sets is: | ||

+ | |||

+ | # Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math> | ||

+ | # Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum_{i}{\alpha_i y_i \underline x_i}</math> | ||

+ | # Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i \le \gamma</math> and then solving <math>\,y_i(\underline{\beta}^T \underline x_i + \beta_0) - 1 = 0</math> | ||

+ | |||

+ | === Support Vectors === | ||

+ | |||

+ | Kernel-based techniques (such as support vector machines, Bayes point | ||

+ | machines, kernel principal component analysis, and Gaussian processes) represent | ||

+ | a major development in machine learning algorithms. Support vector | ||

+ | machines (SVM) are a group of supervised learning methods that can be | ||

+ | applied to classification or regression.<ref name="cccc"> Ovidiu Ivanciuc, Review: Applications of Support Vector Machines in Chemistry, Rev. Comput. Chem. 2007, 23, 291-400</ref>Support vectors are the training points that determine the optimal separating hyperplane that we seek. Also, they are the most difficult points to classify and at the same time the most informative for classification. | ||

+ | |||

+ | For the non-separable case, the third KKT condition yields: if <math>\displaystyle \alpha_i > 0 \Rightarrow y_i(\underline \beta^T \underline x_i+\beta_0)-1+\xi_i=0</math>. These above points are called support vectors. | ||

+ | |||

+ | * Case 1: Support Vectors are on the Margin | ||

+ | ::If <math>\displaystyle \lambda_i > 0 \Rightarrow \xi_i = 0 </math>, then this support vector is on the margin. | ||

+ | |||

+ | * Case 2: Support Vectors are inside the Margin | ||

+ | ::If <math>\displaystyle \alpha = \gamma</math>, then this support vectors is inside the margin. | ||

+ | |||

+ | === Support Vectors Machine Demo Tool === | ||

+ | |||

+ | [[image:SVM_Demo.png]] | ||

+ | |||

+ | This demo tool shows the linear boundary found by SVM and illustrates its behaviour on some 2D data. This is an interactive demonstration to provide insight into how SVM finds the classification boundary.[http://www.mathworks.com/matlabcentral/fileexchange/28302-svm-demo File] | ||

+ | |||

+ | |||

+ | === Relevance Vector Machines === | ||

+ | Support vector machines have been used in a variety of classification and regression applications. Nevertheless, they suffer from a number of limitations, several of which have been highlighted already in earlier sessions. In particular, the outputs of an SVM represent decisions rather than posterior probabilities. Also, the SVM was originally formulated for two classes, and the extension toK > 2 classes is problematic. | ||

+ | There is a complexity parameter C, or ν (as well as a parameter epsilon in the case of regression), that must be found using a hold-out method such as cross-validation. Finally, predictions are expressed as linear combinations of kernel functions that are centred on training data points and that are required to be positive definite. | ||

+ | |||

+ | The relevance vector machine or RVM (Tipping, 2001) is a Bayesian sparse kernel technique for regression and classification that shares many of the characteristics of the SVM whilst avoiding its principal limitations. Additionally, it typically leads to much sparser models resulting in correspondingly faster performance on test data whilst maintaining comparable generalization error. | ||

+ | |||

+ | Borrowed from Bishop's great book, Pattern Recognition and Machine Learning [http://research.microsoft.com/en-us/um/people/cmbishop/prml/]. For more details on RVM, you may refer to this book. | ||

+ | |||

+ | === Further reading on the Kernel Trick === | ||

+ | 1- The kernel trick for distances [http://74.125.155.132/scholar?q=cache:AfKdFY6a1cMJ:scholar.google.com/&hl=en&as_sdt=2000 here] | ||

+ | |||

+ | 2- Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry [http://bioinformatics.oxfordjournals.org/content/20/12/1948.short here] | ||

+ | |||

+ | 3- Kernel-based methods and function approximation [http://ieeexplore.ieee.org/xpl/freeabs_all.jsparnumber=939539 here] | ||

+ | |||

+ | 4- SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641014 here] | ||

+ | |||

+ | 5- SVM application list[http://www.clopinet.com/isabelle/Projects/SVM/applist.html] | ||

+ | |||

+ | 6- Some readings about SVM and the kernel trick [http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/kernels.pdf] and [http://www.cs.cmu.edu/~tom/10601_sp08/slides/svm3-26.ppt] | ||

+ | |||

+ | 7- General overview of SVM and Kernel Methods. Easy to understand presentation. [http://www.support-vector.net/icml-tutorial.pdf] | ||

+ | |||

+ | == ''' Naive Bayes, K Nearest Neighbours, Boosting, Bagging and Decision Trees, - November 18, 2010 ''' == | ||

+ | |||

+ | |||

+ | Now that we've covered a number of more advanced classification algorithms, we can look at some of the simpler classification algorithms that are usually discussed at the beginning of a discussion on classification. | ||

+ | |||

+ | === [http://en.wikipedia.org/wiki/Naive_Bayes_classifier Naive Bayes Classifiers] === | ||

+ | |||

+ | Recall that one of the major drawbacks of the Bayes classifier was the difficulty in estimating a joint density in a multidimensional space. Naive Bayes classifiers are one possible solution to the problem. They are especially popular for problems with high-dimensional features. | ||

+ | |||

+ | |||

+ | A naive Bayes classifier applies a strong independence assumption to the conditional probability <math>\ P(X|Y) = P(x_1,x_2,...,x_d |Y)</math>. It assumes that inputs within each class are conditionally independent. In other words, it assumes the dimensions of the inputs in each class are independent. The Naive Bayes classifier does this by reducing the number of parameters to be estimated dramatically when modeling <math>\ P(X|Y)</math>. | ||

+ | |||

+ | Under the conditional independence assumption: | ||

+ | |||

+ | <math>\ P(X|Y) = P(x_1,x_2,...,x_d |Y) =\prod_{i=1}^{d}P(X = x_i | Y)</math>. | ||

+ | |||

+ | ====Naive Bayes is Equivalent to a Generalized Additive Model==== | ||

+ | |||

+ | Continuing with the notation used above, consider a binary classification problem <math>\, y \in \{0,1\}</math>. Beginning with the corresponding Bayes classifier, we have | ||

+ | |||

+ | :<math>\, h^*(x)= \left\{\begin{matrix} | ||

+ | 1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ | ||

+ | 0 &\mathrm{otherwise} \end{matrix}\right.</math> | ||

+ | |||

+ | :<math>\, = \left\{\begin{matrix} | ||

+ | 1 &\text{if } \log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)>0 \\ | ||

+ | 0 &\mathrm{otherwise} \end{matrix}\right.</math> | ||

+ | |||

+ | :<math>\, = \frac{1}{2}\text{sign}\left[\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)\right] + \frac{1}{2}</math> | ||

+ | |||

+ | :<math>\, = \frac{1}{2}\text{sign}\left[\log\left(\frac{P(Y=1)P(X=x|Y=1)}{P(Y=0)P(X=x|Y=0)}\right)\right] + \frac{1}{2}</math> | ||

+ | |||

+ | then, by the conditional independence assumption provided by Naive Bayes, | ||

+ | |||

+ | :<math>\, = \frac{1}{2}\text{sign}\left[\log\left(\frac{P(Y=1)}{P(Y=0)} \prod_{i=1}^d \frac{P(X_i=x_i|Y=1)}{P(X_i=x_i|Y=0)}\right)\right] + \frac{1}{2}</math> | ||

+ | |||

+ | :<math>\, = \frac{1}{2}\text{sign}\left[\log\left(\frac{P(Y=1)}{P(Y=0)}\right) + \sum_{i=1}^d \log\left(\frac{P(X_i=x_i|Y=1)}{P(X_i=x_i|Y=0)}\right)\right] + \frac{1}{2}</math> | ||

+ | |||

+ | which is a generalized additive model. | ||

+ | |||

+ | ==== Naive Bayes for Continuous Input ==== | ||

+ | |||

+ | A naive Bayes classifier applies a strong independence assumption to the class density <math>\,f_{k}(x)</math>. | ||

+ | |||

+ | Recall that the Bayes rule is : | ||

+ | |||

+ | <math>\ h(x) = \arg\max_{ k} \pi_{k}f_{k}(x). </math> | ||

+ | |||

+ | Although the Bayes classifier is the best classifier, in practice, it is difficult to give an estimate for the multi-variable prior probabilities which are required to determine the classification. Therefore, by assuming independence between the features, we can transform an n-variable distribution into n independent one-variable distributions which are easier to handle, and then apply the Bayes classification. | ||

+ | |||

+ | The density function of inputs can be written as below under the independence assumption : | ||

+ | |||

+ | <math>\ f_{k}(x) = f_{k}(x_1 ,x_2,...,x_d) = \prod_{j=1}^d f_{kj}(x_{j})</math> | ||

+ | |||

+ | Each of the <math>\,d</math> marginal densities can be estimated separately using one-dimensional density estimates. If one of the components <math>\,x_{j}</math> is discrete then its density can be estimated using a histogram. We can thus mix discrete and continuous variables in a naive Bayes classifier. | ||

+ | |||

+ | Naive Bayes classifiers often perform extremely well in practice despite these 'naive' and seemingly optimistic assumptions. This is because while individual class density estimates could be biased, the bias does not carry through to the posterior probabilities. | ||

+ | |||

+ | It is also possible to train naive Bayes classifiers using maximum likelihood estimation. | ||

+ | |||

+ | An interesting example by Jose M. Vidal that shows how the naive Bayes classifier can be used to solve a real-world classification task is available [http://jmvidal.cse.sc.edu/talks/bayesianlearning/nbex.xml here]. | ||

+ | |||

+ | ==== Naive Bayes for Discrete Inputs ==== | ||

+ | |||

+ | Naive Bayes with discrete inputs is very similar to that of continuous inputs. From examples researched, the major difference is that instead of using a probability distribution to characterize the likelihood, we use feature frequencies, or (in English) the proportion of time cases in which variables X fall under class C vs. total number of cases that fall under class C. The following example shows how this would work: | ||

+ | |||

+ | You are running a scientific study meant to find the optimal features under which a girl you encounter will wear her glasses. The data you collect represent the setting of your encounter (library, park, bar), whether she is a student or not (yes, no), and what her eye colour is (blue, green, brown). | ||

+ | |||

+ | {| | ||

+ | |- | ||

+ | ! scope="col" | Case | ||

+ | ! scope="col" | Setting | ||

+ | ! scope="col" | Student | ||

+ | ! scope="col" | Eye colour | ||

+ | ! scope="col" | Wears glasses? | ||

+ | |- | ||

+ | ! scope="row" | 1 | ||

+ | | Bar || yes || Blue || no | ||

+ | |- | ||

+ | ! scope="row" | 2 | ||

+ | | Park || yes || Brown || yes | ||

+ | |- | ||

+ | ! scope="row" | 3 | ||

+ | | Library || no || Green || yes | ||

+ | |- | ||

+ | ! scope="row" | 4 | ||

+ | | Library || no || Blue || no | ||

+ | |- | ||

+ | ! scope="row" | 5 | ||

+ | | Bar || no || Brown || yes | ||

+ | |- | ||

+ | ! scope="row" | 6 | ||

+ | | Park || yes || Green || yes | ||

+ | |- | ||

+ | ! scope="row" | 7 | ||

+ | | Bar || no || Brown || yes | ||

+ | |- | ||

+ | ! scope="row" | 8 | ||

+ | | Library || yes || Brown || yes | ||

+ | |- | ||

+ | ! scope="row" | 9 | ||

+ | | Bar || yes || Green || no | ||

+ | |- | ||

+ | ! scope="row" | 10 | ||

+ | | Park || yes || Blue || no | ||

+ | |} | ||

+ | |||

+ | |||

+ | From this, we extract the following feature frequencies: | ||

+ | |||

+ | {| | ||

+ | |- | ||

+ | ! scope="col" | Eye Colour | ||

+ | ! scope="col" | Wearing glasses | ||

+ | ! scope="col" | Not wearing glasses | ||

+ | |- | ||

+ | ! scope="row" | Blue | ||

+ | | 0 || 3 | ||

+ | |- | ||

+ | ! scope="row" | Brown | ||

+ | | 4 || 0 | ||

+ | |- | ||

+ | ! scope="row" | Green | ||

+ | | 2 || 1 | ||

+ | |} | ||

+ | |||

+ | {| | ||

+ | |- | ||

+ | ! scope="col" | Student? | ||

+ | ! scope="col" | Wearing glasses | ||

+ | ! scope="col" | Not wearing glasses | ||

+ | |- | ||

+ | ! scope="row" | Not a student | ||

+ | | 3 || 1 | ||

+ | |- | ||

+ | ! scope="row" | Student | ||

+ | | 3 || 3 | ||

+ | |} | ||

+ | |||

+ | {| | ||

+ | |- | ||

+ | ! scope="col" | Setting | ||

+ | ! scope="col" | Wearing glasses | ||

+ | ! scope="col" | Not wearing glasses | ||

+ | |- | ||

+ | ! scope="row" | Bar | ||

+ | | 2 || 2 | ||

+ | |- | ||

+ | ! scope="row" | Library | ||

+ | | 2 || 1 | ||

+ | |- | ||

+ | ! scope="row" | Park | ||

+ | | 2 || 1 | ||

+ | |} | ||

+ | |||

+ | You also note that of the 10 girls you saw, 6 were wearing their glasses and 4 weren't. Therefore, given the new case of a green eyed student in a bar, we calculate the probabilities of her wearing vs. not wearing her glasses as such: | ||

+ | |||

+ | P(Wearing glasses | green eyed student in a bar) = P(Wearing glasses | student)*P(Wearing glasses | green eyed)*P(Wearing glasses | in a bar) = 3/6 * 2/6 * 2/6 = 0.0556<br> | ||

+ | P(Not wearing glasses | green eyed student in a bar) = P(Not wearing glasses | student)*P(Not wearing glasses | green eyed)*P(Not wearing glasses | in a bar) = 3/4 * 1/4 * 2/4 = 0.09375<br> | ||

+ | |||

+ | Since P(Wearing glasses | green eyed student in a bar) < P(Not wearing glasses | green eyed student in a bar), it is not likely that a green eyed student will be wearing her glasses in a bar. | ||

+ | |||

+ | ==== Further reading Naive Bayes ==== | ||

+ | |||

+ | The following are some papers to show how Naive Bayes is used in different aspects of classifications. | ||

+ | |||

+ | 1- An empirical study of the naive Bayes classifier [http://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf here] | ||

+ | |||

+ | 2- Naive (Bayes) at forty: The independence assumption in information retrieval [http://www.springerlink.com/content/wu3g458834583125/ here] | ||

+ | |||

+ | 3- Emotion Recognition Using a Cauchy Naive Bayes Classifier [http://www.computer.org/portal/web/csdl/doi/10.1109/ICPR.2002.1044578 here] | ||

+ | |||

+ | === References === | ||

+ | |||

+ | 1. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid | ||

+ | [http://www.cs.ust.hk/~qyang/537/Papers/kohavi96scaling.pdf] | ||

+ | |||

+ | 2. A comparative study of discretization methods for naive-bayes classifiers | ||

+ | [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.298&rep=rep1&type=pdf] | ||

+ | |||

+ | 3. Semi-naive Bayesian classifier | ||

+ | [http://www.springerlink.com/content/m4p7863g61502515/] | ||

+ | |||

+ | === [http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm K-Nearest Neighbors Classification] === | ||

+ | |||

+ | |||

+ | <math>\,K</math>-Nearest Neighbors is a very simple algorithm that classifies points based on a majority vote of the <math>\ k</math> nearest points in the feature space, with the object being assigned to the class most common among its <math>\ k</math> nearest neighbors. <math>\ k</math> is a positive integer, typically small which can be chosen using cross validation. If <math>\ k=1</math>, then the object is simply assigned to the class of its nearest neighbor. | ||

+ | |||

+ | 1. Ties are broken at random. | ||

+ | |||

+ | 2. If we assume the features are real, we can use the Euclidean distance in feature space. More complex distance measures such as an adaptive [http://en.wikipedia.org/wiki/Mahalanobis_distance Mahalanobis distance] that is detailed in Verdier ''et al.'''s [http://www.emse.fr/~verdier/ENSMSE%20CMP%20WP2009_14.pdf paper] can be used as well. | ||

+ | |||

+ | 3. Since the features are measured in different units, we can standardize the features to have mean zero and variance 1. | ||

+ | |||

+ | 4. K can be chosen by cross-validation. | ||

+ | |||

+ | ==== Advantage:==== | ||

+ | Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)[http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-Nearest-Neighbor-Algorithm.html] | ||

+ | |||

+ | Effective if the training data is large.[http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-Nearest-Neighbor-Algorithm.html] | ||

+ | |||

+ | ====Disadvantage:==== | ||

+ | |||

+ | Need to determine value of parameter K (number of nearest neighbors)[http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-Nearest-Neighbor-Algorithm.html] | ||

+ | |||

+ | Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results.[http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-Nearest-Neighbor-Algorithm.html] | ||

+ | |||

+ | Misclassification rate is large when training data is small. | ||

+ | |||

+ | A major drawback is that if the frequency of one class is greater than the other ones significantly , the samples in this class with the largest frequency tend to dominate the prediction of a new point . An approach to overcome it is attaching weights to the samples ,for instance ,add larger weights to the neighbors which are closer to the new points than those that are further away. | ||

+ | |||

+ | ====Note of Interest:==== | ||

+ | In K-nearest neighbours over fitting may occur when a small k is used. In contrast to other methods, in K-nearest neighbours k=1 is the most complex case. | ||

+ | |||

+ | ====Property[http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm#Properties]==== | ||

+ | |||

+ | K-nearest neighbor algorithm has some good and strong results. As the number of data points goes to infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). K-nearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points). | ||

+ | See ''Nearest Neighbour Pattern Classification'', T.M. Cover and P.E. Hart, for interesting theoretical results about the algorithm, including proof of the above properties. | ||

+ | |||

+ | Effect of K on Bias Variance Trade-off for KNN | ||

+ | |||

+ | Large K high bias , low variance | ||

+ | Samll K low bias , high variance | ||

+ | |||

+ | ==== Algorithm ==== | ||

+ | Here is step by step on how to compute K-nearest neighbors KNN algorithm: | ||

+ | |||

+ | 1. Determine number of nearest neighbors (K-parameter). | ||

+ | |||

+ | 2. Calculate the distance between the query-instance and all the training samples. | ||

+ | |||

+ | 3. Sort the distance and determine nearest neighbors based on the 'K-th' minimum distance. | ||

+ | |||

+ | 4. Gather the category of the nearest neighbors. | ||

+ | |||

+ | 5. Use simple majority of the category of nearest neighbors as the prediction value of | ||

+ | the query instance. A random tie-break is used if each class results in the same number of neighbors. | ||

+ | |||

+ | ==== Working Example ==== | ||

+ | |||

+ | We have data from examination laboratory and the objective testing with two attributes (having a flu and his temperature is high) to classify whether a person has a flu or not. Next table shows the four training samples we have: | ||

+ | |||

+ | {| class="wikitable" | ||

+ | |- | ||

+ | ! X1 = having Flu | ||

+ | ! X2= having high temperature | ||

+ | ! Y = Classification | ||

+ | |- | ||

+ | | 7 | ||

+ | | 7 | ||

+ | | Bad - Condition | ||

+ | |- | ||

+ | | 7 | ||

+ | | 4 | ||

+ | | Bad - Condition | ||

+ | |- | ||

+ | | 3 | ||

+ | | 4 | ||

+ | | Good - Condition | ||

+ | |- | ||

+ | | 1 | ||

+ | | 4 | ||

+ | | Good - Condition | ||

+ | |} | ||

+ | |||

+ | |||

+ | Now we have a new patient that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the condition (classification) of this new patient is? | ||

+ | |||

+ | ==== Applying K-NN ==== | ||

+ | |||

+ | 1. Determine parameter K = number of nearest neighbors, Let us assume that K = 3. | ||

+ | |||

+ | 2. Calculate the distance between the query-instance and all the training samples: | ||

+ | Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root) | ||

+ | |||

+ | {| class="wikitable" | ||

+ | |- | ||

+ | ! X1 | ||

+ | ! X2 | ||

+ | ! Square Distance to query instance (3, 7) | ||

+ | ! Rank minimum distance | ||

+ | ! Is it included in 3-Nearest neighbors? | ||

+ | |- | ||

+ | | 7 | ||

+ | | 7 | ||

+ | | (7-3).^2+(7-7).^2=16 | ||

+ | | 3 | ||

+ | | Yes | ||

+ | |- | ||

+ | | 7 | ||

+ | | 4 | ||

+ | | (7-3).^2+(4-7).^2=25 | ||

+ | | 4 | ||

+ | | No | ||

+ | |- | ||

+ | | 3 | ||

+ | | 4 | ||

+ | | (3-3).^2+(4-7).^2=9 | ||

+ | | 1 | ||

+ | | Yes | ||

+ | |- | ||

+ | | 1 | ||

+ | | 4 | ||

+ | | (1-3).^2+(4-7).^2=13 | ||

+ | | 2 | ||

+ | | Yes | ||

+ | |} | ||

+ | |||

+ | |||

+ | 4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K). | ||

+ | {| class="wikitable" | ||

+ | |- | ||

+ | ! X1 | ||

+ | ! X2 | ||

+ | ! Square Distance to query instance (3, 7) | ||

+ | ! Rank minimum distance | ||

+ | ! Is it included in 3-Nearest neighbors? | ||

+ | ! Y = Category of nearest Neighbor | ||

+ | |- | ||

+ | | 7 | ||

+ | | 7 | ||

+ | | (7-3).^2+(7-7).^2=16 | ||

+ | | 3 | ||

+ | | Yes | ||

+ | | Bad | ||

+ | |- | ||

+ | | 7 | ||

+ | | 4 | ||

+ | | (7-3).^2+(4-7).^2=25 | ||

+ | | 4 | ||

+ | | No | ||

+ | | - | ||

+ | |- | ||

+ | | 3 | ||

+ | | 4 | ||

+ | | (3-3).^2+(4-7).^2=9 | ||

+ | | 1 | ||

+ | | Yes | ||

+ | | Good | ||

+ | |- | ||

+ | | 1 | ||

+ | | 4 | ||

+ | | (1-3).^2+(4-7).^2=13 | ||

+ | | 2 | ||

+ | | Yes | ||

+ | | Good | ||

+ | |} | ||

+ | |||

+ | |||

+ | 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance. | ||

+ | |||

+ | We have 2 good and 1 bad, since 2>1 then we conclude that a new patient that pass laboratory test with X1 = 3 and X2 = 7 is included in Good Condition category. | ||

+ | |||

+ | ====Example in Matlab==== | ||

+ | |||

+ | sample = [.9 .8;.1 .3;.2 .6] | ||

+ | training=[0 0;.5 .5;1 1] | ||

+ | group = [1;2;3] | ||

+ | class = knnclassify(sample, training, group) | ||

+ | |||

+ | === Boosting === | ||

+ | |||

+ | [http://en.wikipedia.org/wiki/Boosting Boosting] algorithms are a class of machine learning meta-algorithms that can improve weak classifiers.The idea is to incorporate unequal weights in learning process given higher weights to misclassified points . If we have different weak classifiers which slightly do better than random classification, then by assigning larger weights to points which are misclassified and minimizing the new cost function by choosing an optimal weak classifier,we can update the weights in a way related to the minimum value of the new cost function. This procedure can be repeated for a finite number of times and then a new classifier which is a weighed aggregation of the generated classifiers will be used as the boosted classifier. The better each generated classifier is the more its weight is in the final classifier. | ||

+ | |||

+ | [http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf Paper about Boosting]: | ||

+ | Boosting is a general method for improving the accuracy of any given learning algorithm. | ||

+ | This paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer | ||

+ | from overfitting as well as boosting’s relationship to support-vector machines. Finally, this paper gives some examples of recent applications of boosting. | ||

+ | |||

+ | Boosting is a general method of producing a very accurate prediction rule by combining rough and moderately inaccurate "rules of thumb." Much recent work has been on the "AdaBoost" boosting algorithm and its extensions. | ||

+ | [http://www.cs.princeton.edu/~schapire/boost.html] | ||

+ | |||

+ | ==== [http://en.wikipedia.org/wiki/AdaBoost AdaBoost] ==== | ||

+ | AdaBoost (which is short for adaptive boosting) is a linear classifier with all its desirable properties. Its output converges to the logarithm of likelihood ratio. | ||

+ | It has good generalization properties and is a feature selector with a principled strategy (minimization of upper | ||

+ | bound on empirical error). | ||

+ | AdaBoost produces a sequence of gradually more complex classifiers). | ||

+ | |||

+ | Advantages | ||

+ | |||

+ | *Very simple to implement | ||

+ | *Feature selection on very large sets of features | ||

+ | *Fairly good generalization | ||

+ | |||

+ | Disadvantages | ||

+ | |||

+ | *Suboptimal solution for <math>\,\Rightarrow\alpha</math> | ||

+ | *Can overfit in presence of noise | ||

+ | |||

+ | [[File:1111.JPG|200px|thumb|right|Fig1.j=1]] | ||

+ | [[File:2222.JPG|200px|thumb|right|Fig2.j=2]] | ||

+ | [[File:3333.JPG|200px|thumb|right|Fig3.j=3]] | ||

+ | [[File:4444.JPG|200px|thumb|right|Fig4.j=4]] | ||

+ | [[File:5555.JPG|200px|thumb|right|Fig5.j=5]] | ||

+ | [[File:6666.JPG|200px|thumb|right|Fig6.j=6]] | ||

+ | [[File:7777.JPG|200px|thumb|right|Fig7.j=7]] | ||

+ | [[File:8888.JPG|200px|thumb|right|Fig8.j=J]] | ||

+ | |||

+ | ==== AdaBoost Algorithm ==== | ||

+ | |||

+ | |||

+ | |||

+ | Let's first look at the adaptive boosting algorithm: | ||

+ | #Set all the weights of all points equal <math>w_i\leftarrow \frac{1}{n}</math> where we have <math>\,n</math> points. | ||

+ | #For <math>j=1,\dots, J</math> | ||

+ | ## Find <math>h_j:X\rightarrow \{-1,+1\}</math> that minimizes the weighted error <math>\,L_j</math><br><math>h_j=\mbox{argmin}_{h_j \in H} L_j </math> where <math>L_j=\frac{\sum_{i=1}^n w_i I[y_i\neq h_j(x_i)]}{\sum_{i=1}^n w_i} </math>.<br /><math>\ H </math> is a set of classifiers which need to be improved and <math>I</math> is:<br /> | ||

+ | :<math>\, I= \left\{\begin{matrix} | ||

+ | 1 & for \quad y_i\neq h_j(x_i) \\ | ||

+ | 0 & for \quad y_i = h_j(x_i) \end{matrix}\right.</math><br /> | ||

+ | ## Let <math>\alpha_j\leftarrow\log(\frac{1-L_j}{L_j})</math> | ||

+ | ## Update the weights: <math>w_i\leftarrow w_i e^{a_j I[y_j\neq h_j(x_i)]}</math> | ||

+ | #The final hypothesis is <math>h(x)=\mbox{sign}\left(\sum_{j=1}^J \alpha_j h_j(x)\right)</math><br /> | ||

+ | |||

+ | The final hypothesis <math>h(x)</math> can be completely nonlinear. <br /> | ||

+ | |||

+ | |||

+ | * If we have a classifier that is random <math> {L_j} = 0 \Rightarrow \alpha_j = 0</math>, where else if the classifier is a little bit better than chance <math> \alpha_j\ \geq 0 </math> | ||

+ | * If we have a good classifier and incorrectly misclassified <math>{x_i}</math>, then <math>{w_i}</math> is increased heavily | ||

+ | |||

+ | When applying Adaboosting to different classifiers, the first step in 2 may be different since we can define the most proper misclassification error according to the problem. However, the major idea is to give higher weight to misclassified examples, which does not change across classifiers. | ||

+ | |||

+ | AdaBoosting works very well in practice, and there are a lot of research and published works on why it has a good performance. One possible explanation is that it actually maximizes the margin of classifiers. | ||

+ | |||

+ | We can see that in AdaBoost if training points are accurately classified, then their weights of being used in the next classifier is kept unchanged, while if points are not accurately classified, their weights of being used again is raised. At a result easier examples get classified in the very first few classifiers and hard examples are learned later with increasing emphasis. Finally, all the classifiers are combined through a majority vote, which is also weighted by their accuracy, taking consideration of both the easy and hard points. In other words, the Boost focuses on the more informative or difficult points. | ||

+ | |||

+ | A short but interesting video by Kai O. Arras that shows how AdaBoost can create a strong classifier of a toy problem is available [http://www.youtube.com/watch?v=k4G2VCuOMMg here]. | ||

+ | |||

+ | ==== Training and Test Error of Boosting.==== | ||

+ | |||

+ | The most basic theretical property of AdaBoost concerns its ability to reduce the training error.Suppose that the cost function <math>\ L_j = \frac{1}{2}- \gamma_{j}, \gamma_{j}>0 </math> .Freund and Schapire[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8918] prove that training error of the final hypothesis h is at most | ||

+ | <math>\ \prod_{j} 2 \sqrt{L_j(1-L_j)}= \prod \sqrt{1-4 \gamma_j^2} \leq | ||

+ | e^{-2 \Sigma_{j} \gamma_j^2} </math> . | ||

+ | |||

+ | Thus , if each weak classifier is slightly better than random which means <math>\ \gamma_j > 0 </math>, the training error drops exponentially fast . | ||

+ | |||

+ | |||

+ | Freund and Schapire[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8918] show that the true error, with high probability , is at most | ||

+ | |||

+ | <math>\ \hat{Pr}[H(x) \neq y]+ \tilde{O} (\sqrt{\frac{m}{TD}}) </math> | ||

+ | |||

+ | where <math>\ T </math> is the number of boosting rounds and <math>\ \hat{Pr} [.] </math> | ||

+ | denotes the empirical probability on training sample. | ||

+ | |||

+ | This bounds suggests that AdaBoost will overfit if run too many rounds. In fact , this sometimes happen. However, in early experiments, several author abserved empirically that boost often does not overfit even run for thousands of times .Moreover, it was abserved that Adaboost would sometimes continue to drive down the true error after the training error had reached zero. | ||

+ | Therefor Boosting often does not suffer from overfitting .[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.3285] | ||

+ | |||

+ | ==== AnyBoost ==== | ||

+ | |||

+ | Many boosting algorithms belong to a class called AnyBoost which are gradient descent algorithms for choosing linear combinations of elements of an inner product space in order to minimize some cost function. | ||

+ | |||

+ | We are primarily interested in weighted combinations of classifiers <math>H(x) = sgn(\sum_{j=1}^J \alpha_j h_j(x))</math> | ||

+ | |||

+ | We want to find H such that the cost functional <math>C(F) = \frac{1}{m}\sum_{i=1}^m c(y_i F(x_i))</math> is minimized for a suitable cost function <math>c</math> | ||

+ | |||

+ | <math>h_j:X\rightarrow \{-1,+1\}</math> are weak base classifiers from some class <math>\ H</math> and <math> \alpha_j</math> are classifier weights. The margin of an example <math>(x_i,y_i)</math> is defined by <math>y_i H(x_i)</math>. | ||

+ | |||

+ | The base hypotheses h and their linear combinations H can be considered to be elements of an inner product function space <math>(S,\langle,\rangle)</math>. | ||

+ | |||

+ | We define the inner product as <math>\langle F,G \rangle = \frac{1}{m}\sum_{i=1}^m F(x_i) G(x_i)</math> but the AnyBoost algorithm is valid for any cost function and inner product. We have a function <math>H</math> as a linear combination of base classifiers and wish to add a base classifier h to H so that cost <math>\ C(H + \epsilon h)</math> decreases for arbitrarily small <math> \epsilon</math>. The direction we seek is found by maximizing <math>-\langle\nabla C(H),h\rangle</math> | ||

+ | |||

+ | |||

+ | AnyBoost algorithm: | ||

+ | |||

+ | #<math>\ H_0(x) = 0</math> | ||

+ | #For <math>j=0,\dots, J</math> | ||

+ | ## Find <math>h_{j+1}:X\rightarrow \{-1,+1\}</math> that maximizes the inner product <math>-\langle\nabla C(H),h_{j+1}\rangle</math> | ||

+ | ## If <math>-\langle\nabla C(H),h_{j+1}\rangle \leq 0 </math> then | ||

+ | ### Return <math>\ H_j</math> | ||

+ | ## Choose step size <math>\ \alpha_{j+1}</math> | ||

+ | ## <math>\ H_{j+1} = H_j + \alpha_{j+1} h_{j+1}</math> | ||

+ | #The final classifier is <math>\ H_{J+1}</math> | ||

+ | |||

+ | Figures 1 to 8 show how Anyboost algorithm works to classify the data. | ||

+ | |||

+ | |||

+ | Other voting methods, including AdaBoost, can be viewed as special cases of this algorithm. | ||

+ | |||

+ | ====Connection between Boost and Support Vector Machine==== | ||

+ | |||

+ | There are some relationships between Boost and Support Vector Machines. Freund and Schapire[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.3285]show that Adaboost and SVMs can be described in a way that they have a similar goal of maximizing a minimal margin while with different norms. | ||

+ | |||

+ | |||

+ | Combination with boost and SVM is proved to be beneficial[http://www.springerlink.com/content/bg1xcjbn86349y2e/] .One method is to boost the SVMs with different norms such as <math>\ l_1 </math> norm , <math>\ l_{\infty} </math>. While the <math>\ l_2 </math> norm SVMs is widely used , other norms are useful in some special cases .Here is some papers which provide some methods to combine boost and SVM : | ||

+ | |||

+ | A Method to Boost Support Vector Machines.[http://www.springerlink.com/content/bg1xcjbn86349y2e/ here] | ||

+ | |||

+ | Adaptive Boosting of Support Vector Machine Component Classifiers Applied in Face Detection.[http://www.ece.rice.edu/~sv4/papers/EBC_86_607.pdf here] | ||

+ | |||

+ | ===Boosting k-Nearest Neighbor Classifier=== | ||

+ | As the author stated, although the k-nearest neighbours classifier is one of the most widely used methods of classification due to several interesting features, no successful method has been reported so far to apply boosting to k-NN. As boosting methods have proved very effective in improving the generalization capabilities of many classification algorithms, proposing an appropriate application of boosting to k-nearest neighbours is of great interest. In the article, http://cib.uco.es/documents/TR-2008-03.pdf, Nicolas Garcıa Pedrajas gave more details about how to combine the boosting methods into KNN method, also the brief summary of related work on KNN and boosting methods is presented. Finally, the comparison of evaluation on methods is given under an experimental data. | ||

+ | |||

+ | === Reference === | ||

+ | |||

+ | The Elements of Statistical Learning, Second Edition. Trevor Hastie,Robert Tibshirani,Jerome Friedman. | ||

+ | |||

+ | K-Nearest Neighbors Tutorial.[http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-Nearest-Neighbor-Algorithm.html] | ||

+ | |||

+ | A Method to Boost Support Vector Machines.[http://www.springerlink.com/content/bg1xcjbn86349y2e/] | ||

+ | |||

+ | === Bagging === | ||

+ | |||

+ | ==== History ==== | ||

+ | |||

+ | Bagging ('''B'''ootstrap '''agg'''regat'''ing''') was proposed by [[Leo Breiman]] in 1994 to improve the classification by combining classifications of randomly generated training sets. See Breiman, 1994. Technical Report No. 421. | ||

+ | |||

+ | Bagging, or [http://en.wikipedia.org/wiki/Bootstrap_aggregating bootstrap aggregating], is another technique used to reduce the variance of classifiers with high variability. It exploits the fact that a bootstrap mean is approximately equal to the posterior average. It is most effective for highly nonlinear classifiers such as decision trees. In particular because of the highly unstable nature of these classifiers, they stand most likely to benefit from bagging. | ||

+ | |||

+ | Bagging is one of the most effective computationally intensive procedures to improve on unstable estimators or classifiers, useful especially for high dimensional data set problems. Hard decisions create instability, and bagging is shown to smooth such hard decisions, yielding smaller variance and mean squared error. | ||

+ | |||

+ | ==== Bagging Classifier ==== | ||

+ | The idea is to train classifiers <math>\ h_{1}(x)</math> to <math>\ h_{B}(x)</math> using B bootstrap samples from the data set. The final classification is obtained using an average or 'plurality vote' of the B classifiers as follows: | ||

+ | |||

+ | :<math>\, h(x)= \left\{\begin{matrix} | ||

+ | 1 & \frac{1}{B} \sum_{i=1}^{B} h_{b}(x) \geq \frac{1}{2} \\ | ||

+ | 0 & \mathrm{otherwise} \end{matrix}\right.</math> | ||

+ | |||

+ | Many classifiers, such as trees, already have underlying functions that estimate the class probabilities at <math>\,x</math>. An alternative strategy is to average these class probabilities instead of the final classifiers. This approach can produce bagged estimates with lower variance and usually better performance. | ||

+ | |||

+ | ==== Example: Ozone data ==== | ||

+ | This example illustrates the basic principles of bagging.[http://en.wikipedia.org/wiki/Bootstrap_aggregating Ozone Data] | ||

+ | |||

+ | === Boosting vs. Bagging === | ||

+ | |||

+ | • Bagging doesn’t work so well with stable models.Boosting might still help. | ||

+ | |||

+ | • Boosting might hurt performance on noisy datasets. Bagging doesn’t have this problem. | ||

+ | |||

+ | • In practice bagging almost always helps. | ||

+ | |||

+ | • On average, boosting usually helps more than bagging, but it is also more common for boosting to hurt performance. | ||

+ | |||

+ | • The weights grow exponentially. | ||

+ | |||

+ | • Bagging is easier to parallelize. | ||

+ | |||

+ | ==== Reference ==== | ||

+ | |||

+ | 1. CS578 Computer Science Dept., Cornell University, Fall 2004 | ||

+ | |||

+ | 2. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants | ||

+ | [http://www.springerlink.com/content/l006m1614w023752/] | ||

+ | |||

+ | 3. Bagging predictors | ||

+ | [http://www.springerlink.com/content/l4780124w2874025/] | ||

+ | |||

+ | ====Example==== | ||

+ | An example given by comparison of the bagging and the boosting methods http://www.doiserbia.nb.rs/ft.aspx?id=1820-02140602057M | ||

+ | |||

+ | ===Decision Trees=== | ||

+ | |||

+ | A "decision tree" is used as a visual and analytical decision support tool, where the expected values of competing alternatives are calculated. It uses principle of divide and conquer for classification. Decision trees have traditionally been created manually. Trees can be used for classification, regression, or both. Trees map features of a decision problem onto a conclusion, or label. | ||

+ | We fit a tree model by minimizing some measure of impurity. For a single covariate <math>\,X_{1}</math> we choose a point t on the real line that splits the real line into two sets R1 = <math>(-\infty,t]</math>, R2 = <math>[t,\infty)</math> in a way that minimizes impurity. | ||

+ | |||

+ | We denote by <math> \hat p_{s}(j) </math> the proportion of observations in <math>\ R_{s}</math> that <math>\ Y_{i} = j</math>. | ||

+ | |||

+ | |||

+ | <math> \hat p_{s}(j) = \frac{\sum_{i = 1}^{n} I(Y_{i} = j,X_{i} \in R_{s})}{\sum_{i = 1}^{n} I(X_{i} \in R_{s})}</math> | ||

+ | |||

+ | ==== CART ==== | ||

+ | Classification and regression trees (CART) is a non-parametric Decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. (Wikipedia) | ||

+ | |||

+ | Classification and Regression Trees is a classification method which uses historical data to construct so-called decision trees. Decision trees are then used to classify new data. In order to use CART we need to know number of classes a priori. ([http://edoc.hu-berlin.de/master/timofeev-roman-2004-12-20/PDF/timofeev.pdf]) | ||

+ | |||

+ | CART methodology was developed in 80s by Breiman, Freidman, Olshen, Stone in their paper ”Classification and Regression Trees” (1984). For building decision trees, CART uses so-called learning sample - a set of historical data with pre-assigned classes for all observations. For example, learning sample for credit scoring system would be fundamental information about previous borrows (variables) matched with actual | ||

+ | payoff results (classes). ([http://edoc.hu-berlin.de/master/timofeev-roman-2004-12-20/PDF/timofeev.pdf]) | ||

+ | |||

+ | Official Statistics Toolbox of Matlab provides CART. Here is a simple code for training and evaluation of a CART. | ||

+ | |||

+ | % Tree Construction - Learning Phase - Statistics Toolbox Built-in Function | ||

+ | tree = classregtree(data_train,labels_train,'method','classification'); | ||

+ | % Tree in Action - Recalling Phase - Statistics Toolbox Built-in Function | ||

+ | labels_test_hat = tree.eval(data_test)); | ||

+ | % Confusion Matrix Estimation - Statistics Toolbox Built-in Function | ||

+ | C = confusionmat(labels_test,labels_test_hat); | ||

+ | CCR = sum(diag(C))/sum(sum(C)); | ||

+ | |||

+ | These are some pros and cons of CART (from here: [http://edoc.hu-berlin.de/master/timofeev-roman-2004-12-20/PDF/timofeev.pdf]) | ||

+ | |||

+ | 1. CART is nonparametric. Therefore this method does not require specification of any functional form. | ||

+ | |||

+ | 2. CART does not require variables to be selected in advance. CART algorithm will itself identify the most significant variables and eliminate non-significant ones. | ||

+ | |||

+ | 3. CART results are invariant to monotone transformations of its independent variables. Changing one or several variables to its logarithm or square root will not change the structure of the tree. Only the splitting values (but not variables) in the questions will be different. | ||

+ | |||

+ | 4. CART can easily handle outliers. Outliers can negatively affect the results of some statistical models, like Principal Component Analysis (PCA) and linear regression. But the splitting algorithm of CART will easily handle noisy data: CART will isolate the outliers in a separate node. This property is very important, because financial data very often have outliers due to financial crisises or defaults. | ||

+ | |||

+ | ==== Examples==== | ||

+ | [[image:Decision_trees.GIF]] | ||

+ | |||

+ | In this classification tree above ,we classify the samples by two features <math>\ x_1 </math> and <math>\ x_2 </math>. First , we classify the data according to the <math>\ x_1 </math> features . Then we make more accurate classification by <math>\ x_{2} </math> feature. | ||

+ | |||

+ | [[image:Decision_Square.GIF]] | ||

+ | |||

+ | A classification tree can also be viewed as squares as above . The classification rules can be more and more complex to make the training error rate reach to zero . | ||

+ | |||

+ | |||

+ | Extension: | ||

+ | [http://www.mindtools.com/dectree.html Decision Tree Analysis Decision Trees from Mind Tools] | ||

+ | |||

+ | ''useful link'': | ||

+ | |||

+ | Algorithm, Overfitting, Examples:[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch3.pdf],[http://robotics.stanford.edu/people/nilsson/MLDraftBook/ch6-ml.pdf],[http://www.autonlab.org/tutorials/dtree18.pdf] | ||

+ | |||

+ | A decision Tree is consisted of 3 types of nodes:- | ||

+ | |||

+ | 1. Decision nodes - commonly represented by squares<br /> | ||

+ | 2. Chance nodes - represented by circles<br /> | ||

+ | 3. End nodes - represented by triangles | ||

+ | |||

+ | ====Reference articles on decision tree method==== | ||

+ | ( Based on S. Appavu alias Balamurugan, Ramasamy Rajaram Effective solution for unhandled exception in decision tree induction algorithms ) | ||

+ | |||

+ | =====Various improvements over the original decision tree algorithm===== | ||

+ | |||

+ | 1. ID3 algorthm: Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.<br /> | ||

+ | 2. ID4 algorthm: Utgoff, P. E. (1989). Incremental induction of decision trees. Machine Learning, 4,161–186<br /> | ||

+ | 3. ID5 algorthm: Utgoff, P. E. (1988). ID5: An Incremental ID3. Proceedings of the fifth international conference on machine learning. San Mateo, CA: Morgan Kaufmann Publishers. pp. 107–120.<br /> | ||

+ | 4. ITI algorthm: Utgoff, P. E. (1994). An improved algorithm for incremental induction of decision trees. In Proceedings of the 11th international conference on machine learning, pp.318–325.<br /> | ||

+ | 5. C4.5 algorthm: Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufman Publishers.<br /> | ||

+ | 6. CART algorthm: Breiman, L., Friedman, J., Olsen, R., & Stone, C. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks.<br /> | ||

+ | |||

+ | =====Various strategies for decision tree improvements===== | ||

+ | |||

+ | 1. Buntine, W. (1992). Learning classication trees. Statistics and Computing, 2, 63–73.<br /> | ||

+ | 2. Hartmann, C. R. P., Varshney, P. K., Mehrotra, K. G., & Gerberich, C. L. (1982). Application of information theory to the construction of efficient decision trees. IEEE Transactions on Information Theory, 28, 565–577.<br /> | ||

+ | 3.Kohavi & Kunz, 1997 Kohavi, R., & Kunz, C. (1997). Option decision trees with majority votes. In Proceedings of the 14th international conference on machine learning, Morgan Kaufmann.<br /> | ||

+ | 4. Mickens, J., Szummer, M., Narayanan, D., Snitch (2007). Interactive decision trees for troubleshooting misconfigurations. In Proceedings of second international workshop on tackling computer systems problems with machine learning techniques.<br /> | ||

+ | 5. Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man–Machine Studies, 27, 221–234.<br /> | ||

+ | 6. Utgoff, P. E. (2004). Decision tree induction based on efficient tree restructuring. International Journal of Machine Learning, Springer, pp. 5–44.<br /> | ||

+ | |||

+ | ==== Common Node Impurity Measures ==== | ||

+ | |||

+ | Some common node impurity measures are: | ||

+ | |||

+ | * Misclassification error: | ||

+ | |||

+ | <math> 1 - \hat p_{s}(j) </math> | ||

+ | |||

+ | * Gini Index: | ||

+ | |||

+ | <math> \sum_{j \neq i} \hat p_{s}(j)\hat p_{s}(i)</math> | ||

+ | |||

+ | * Cross-entropy: | ||

+ | |||

+ | <math> - \sum_{j = 1}^{K} \hat p_{s}(j) log(\hat p_{s}(j))</math> | ||

+ | |||

+ | ====Advantages==== | ||

+ | |||

+ | Amongst decision support tools, decision trees (and [[influence diagrams]]) have several advantages: | ||

+ | |||

+ | Decision trees: | ||

+ | * Are simple to understand and interpret.People are able to understand decision tree models after a brief explanation. | ||

+ | * Have value even with little hard data.Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes. | ||

+ | * Use a [[white box (software engineering)|white box]] model. If a given result is provided by a model, the explanation for the result is easily replicated by simple math. | ||

+ | * Can be combined with other decision techniques.The following example uses Net Present Value calculations, PERT 3-point estimations (decision #1) and a linear distribution of expected outcomes (decision #2): | ||

+ | |||

+ | |||

+ | ====References==== | ||

+ | |||

+ | 1. SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming | ||

+ | [http://www.mitpressjournals.org/doi/abs/10.1162/0899766053491896] | ||

+ | |||

+ | 2. On the generalization of soft margin algorithms | ||

+ | [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1035123] | ||

+ | |||

+ | 3. Support Vector Machine Soft Margin Classifiers: Error Analysis | ||

+ | [http://portal.acm.org/citation.cfm?id=1005332.1044698] | ||

+ | |||

+ | == ''' Project Presentations - November 23, 2010 ''' == | ||

+ | |||

+ | === Project 14 - V-C Dimension, Mistake Bounds, and Littlestone Dimension === | ||

+ | |||

+ | To summarize, the goal of this presentation is to give light on the topics of vcdim, mistake bound, and ldim. Walking through each, we find out why they are useful to classification, and why they are very difficult and we might want to consider another approach. | ||

+ | |||

+ | ==== Introduction ==== | ||

+ | |||

+ | We begin by defining what we mean by learning. Let X be a fixed set. For the sake simplicity, we will assume that X is a finite or n-dimensional Euclidean space. A concept class is a non-empty set <math>C \subseteq 2^X</math>. We call an element of C a concept. Let <math>c \in C</math>, then <math>I_c(x) = {1 if x \in c, 0 otherwise}</math>. Then we call <math>sam(x) = {(x_1, I_c(x_1)), \dots (x_m, I_c(x_m))}</math> the m-sample of a concept <math>c \in C</math> generated by <math>x \subseteq X</math>. The sample space S_C is the set of m-samples <math>\forall m \forall c \in C \forall x \subseteq X</math>. | ||

+ | |||

+ | Let <math>A_{C,H}</math> denote all the functions <math>A:S_C \rightarrow H</math>, where H is the hypothesis space. We call $h \in H$ a hypothesis. <math>A \in A_{C,H}</math> is consistant if it's hypothesis always agrees with the sample. Let P be the probability distribution of X, then the error of A for c is given by <math>err_{A,C,P}(x) = P(c \neq h</math>). | ||

+ | |||

+ | For example, our data over the real numbers would be classified as 1 if it is in the concept class, and 0 otherwise. Our hypothesis space might be the set of all intervals over the real number line. | ||

+ | |||

+ | An obvious way of defining learning is that we want our algorithm (<math>A_{C,H}</math>) to have lower error with higher probability of being correct as we increase the number of elements in our sample. For example, each class 0 and 1 sample from the real number line should give us a better half space separating the classes. Such an algorithm is called probably approximately correct or uniformly learnable. More formally, let <math>m(\epsilon, \delta)</math> be an integer valued function. We say that <math>A \in A_{C,H}</math> is a learning function with respect to a probability distribution P over X with sample size <math>m(\epsilon, \delta), 0 \le \epsilon, \delta \le 1</math>, if <math>P({x \subseteq X : err_{A,C,P} > \epsilon}) < \delta</math>. We say that C is uniformly learnable by H under P. If A is a learning function for all probability distributions P, then A is called a learning function and C is uniformly learnable by H. | ||

+ | |||

+ | An example of this definition is the use of rectangles to bound the area classified as 1 in <math>R^2</math>. The edges of the rectangle are determined by the minimum and maximum values of the points labelled 1. We can show that rectangles satisfy our definition for uniformly learnable with <math>m(\epsilon,\delta) = 4/\epsilon ln(4/\delta)</math>. The proof will be left as an exercise (Hint: Use rectangles around the edges of our first rectangle to estimate error). | ||

+ | |||

+ | ==== VC Dimension ==== | ||

+ | |||

+ | With formalities aside, we can now begin discussion of the Vapnik-Chervonenkis dimension (vcdim). Let H be a family of subsets of some universe X. The vcdim of H, vcdim(H), is the largest subset S of X such that <math>\forall T \subseteq \exists c[T] \in C</math> such that <math>S \cap c[T] = T</math>. The vcdim is essentially the largest set that our hypothesis class can break up into any separation of labels 0 and 1. | ||

+ | |||

+ | Example 1. | ||

+ | |||

+ | Problem: Let X be the real number line, and H be the set of intervals over the real number line. What is the vcdim(H)? | ||

+ | |||

+ | Solution: To find a lower bound for the vcdim, all we need is to find an example. Consider two points, a and b, on the real number line, <math>a < b</math>. We can create 4 intervals, (a,a), (b,b), (a,b), and <math>(\frac{a+b}{2},\frac{a+b}{2})</math>, to include a, b, a and b, and no points, respectively. Thus, the lower bound for the vcdim is 2. What about an upper bound? We have to create a more general argument. Let <math>S \subseteq X</math>, and a, b, and <math>c \in S, a < b < c</math>. Notice that no interval can cover a and c and not cover b. Thus, <math>vcdim(H) \le 2</math>. Thus, vcdim(H) = 2. | ||

+ | |||

+ | Example 2. | ||

+ | |||

+ | Problem: Let <math>X = R^2</math>, H be the set of half spaces on X. What is the vcdim(H)? | ||

+ | |||

+ | Solution: We take three points, a,b, and c, and we separate them by using half spaces along (a,b) to label a and b in class 1, or flip the half space to obtain c. Similarly for all the other combinations. To classify all three as 1 we need only move the half space to the furthest right, or flip to label all three class 0. To show an upper bound, we consider the concave set formed by all four, or the triangle with one within. This is left as an exercise. | ||

+ | |||

+ | Example 3. | ||

+ | |||

+ | Problem: We wish to generalize the above problem to R^n. | ||

+ | |||

+ | Solution: Notice that the vcdim in Problem 2 is n+1. We can construct this lower bound by considering the case where our points are the n unit vectors and the origin. When the origin isn't included, we face the half space away and include all the unit vectors which are classified 1 to produce a half space. When the origin is included, we approach similarly. To prove an upper bound, we need Radon's Theorem from geometry: | ||

+ | |||

+ | Radon's Theorem: Any set <math>A \subseteq R^n</math> of size <math>\ge n + 2</math> can be partitioned into B and A\B such that <math>CH(B) \cap CH(A\ B) \neq 0</math> (CH(X) is the smallest convex hull of X). | ||

+ | |||

+ | We can see how this is applicable by noticing that halfspaces are convex hulls. Thus, any convex hull of a set of points within the halfspace lies in the half space. So, given any combination of n+2 points, we can find a separation such that the half space labelling A intersects the half space labelling B, which contradicts. Thus, vcdim(H) = n+1. | ||

+ | |||

+ | So, now that we understand the vc dimension, why is it useful? Here are some example results: | ||

+ | |||

+ | Theorem: H is uniformly learnable if and only if the vcdim(H) is finite. | ||

+ | |||

+ | That's a pretty strong theorem. The proof is contained in "Learnability and the Vapnik-Chervonenkis Dimension." However, the vc dimension also gives us a lot of nice theorems about error bounds. Looking to wikipedia | ||

+ | http://en.wikipedia.org/wiki/Vcdim, we find one such bound: | ||

+ | |||

+ | <math>Test Error \le Training Error + \sqrt{\frac{d(log(2n/d) + 1) - log(d/4)}{n}}</math> | ||

+ | |||

+ | However, the vcdim does have a very large flaw: | ||

+ | |||

+ | Theorem: The vc dimension problem is LOGNP-complete. | ||

+ | |||

+ | Proof Sketch: We use the characterization of NP-complete problems to characterize LOGNP-complete problems. Then using this, we show a polynomial-time reduction from the characterization to the vc dimension problem. | ||

+ | |||

+ | This basically tells us that it is very hard to compute the vc dimension. So, now that we have all these nice results, but we cannot really use them, what do we do? | ||

+ | |||

+ | ==== Mistake Bounds ==== | ||

+ | |||

+ | The mistake bound of a hypothesis class H is: | ||

+ | |||

+ | <math>\frac{sup}{sequence x_1, \dots , x_n}</math> <math>\frac{sup}{h \in H}</math> (# errors A makes on <math>(x_1, h(x_1)), \dots , (x_n, h(x_n))</math> | ||

+ | |||

+ | Example: | ||

+ | |||

+ | Problem: The adversary chooses a number between 1 and n. What is an algorithm to defeat the adversary and it's mistake bound? | ||

+ | |||

+ | Solution: We can use a binary search to obtain a mistake bound of log(n). | ||

+ | |||

+ | The mistake bound has a relatively natural meaning. Given a sequence of points, how many mistakes will our algorithm make. In fact, we can find a nice bound on the mistake bound. We say an algorithm is realizable if there exists a hypothesis which is consistent. If the algorithm is realizable, then we get the following result. | ||

+ | |||

+ | Theorem: For every finite domain X, finite H, the mistake bound is bounded above by log(H). | ||

+ | |||

+ | Proof Sketch: Each time we receive a point, we label it according to the majority of the hypotheses remaining. If the label is incorrect, we remove the majority. We can remove the majority at most log(H) times before we have a consistent hypothesis. This algorithm is called the majority algorithm. | ||

+ | |||

+ | This result almost extends to the unrealizable case using the weighted majority algorithm by Littlestone. | ||

+ | |||

+ | Though on the surface the mistake bound seems to be a completely different problem from the vc dimension, it turns out that they are related, as the following theorem shows: | ||

+ | |||

+ | Theorem: <math>vcdim(H) \le mistake bound (H)</math>. | ||

+ | |||

+ | Proof Sketch: Let vcdim(H) = k, <math>{v_1, \dots , v_k}</math> be a set of points shattered by A. Then the hypothesis set has k ways to separate the data and we can choose the opposite class each time. | ||

+ | |||

+ | Unfortunately, it turns out that finding the mistake bound is just as hard as finding the vc dimension. But it did give us a nice upper bound on the vc dimension. There exist approximation algorithms which estimate the mistake bound, but they are dependent on the vc dimension. So, let us consider a final option. | ||

+ | |||

+ | ==== Littlestone Dimension ==== | ||

+ | |||

+ | An instance-labelled tree is a tree which begins at a root node and whose edge to the child on the left is labelled 0, and child on the right is labelled 1. An instance-labelled tree is shattered by a class H if for any root-to-leaf path <math>(x_1, y_1), \dots , (x_d, y_d)</math>, there is some <math>h \in H</math> that is shattered by H. | ||

+ | |||

+ | Example: A tree with only left paths and one right edge for each root to leaf node path is an instance-labelled tree which can be shattered by the single point hypothesis set (labelling only a single point 1). | ||

+ | |||

+ | For a non-empty class, H, Ldim(H) is the largest integer d such that there exist a full binary tree of depth d that is shattered by H. | ||

+ | |||

+ | Example: | ||

+ | |||

+ | Problem: What is ldim(<math>H_{sing}</math>)? | ||

+ | |||

+ | Solution: Since the largest set that can be shattered by ldim is the single point, the largest full binary tree is the root and one child node representing the shattered point. | ||

+ | |||

+ | Theorem: The optimal mistake bound equals the Littlestone dimension. | ||

+ | |||

+ | Proof: For an input of points, we can simply take the longest root to leaf node branch in the instance-labelled tree to make the mistake bound equal to the Littlestone dimension. | ||

+ | |||

+ | Since ldim is equal to the mistake bound, results that apply to ldim also apply to the mistake bound, and thus, the vcdim. In "Agnostic Online Learning," Ben-David, et al. show that there exists a set at most the size of ldim which can be run with their Expert algorithm to find a hypothesis that makes at most as many errors as the best hypothesis in the hypothesis class. Thus, ldim has many uses. Unfortunately, ldim is also very hard to compute. As far as my research has shown, there currently exist no approximation algorithms for ldim. Thus, to continue researching ldim's complexity is the next direction. | ||

+ | |||

+ | ==== Citations and Further Reading ==== | ||

+ | |||

+ | 1. Ben-David, Shai, et al. "Agnostic Online Learning." | ||

+ | |||

+ | 2. Blumer, Anselm, et al. "Learnability and the Vapnik-Chervonenkis Dimension." ACM 0004-5411. pp. 929-965 (1989). | ||

+ | |||

+ | 3. Littlestone, Nick. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm." Machine Learning, 2. pp. 285-318. Kluwer Academic Publishers, Boston (1988). | ||

+ | |||

+ | 4. Papadimitriou, Christos H. and Mihalis Yannakakis. "On Limited Nondeterminism and the Complexity of the V-C Dimension." Journal of Computer and System Sciences, 53. pp. 161-170 (1996). | ||

+ | |||

+ | |||

+ | == Supervised PCA - December 3, 2010== | ||

+ | As we had in our very last (unofficial) meeting, we can briefly describe a possible approach for having PCA as a supervised dimensionality reduction methodology. This approach is based on the Hilbert-Schmidt Independence Criterion, or briefly HSIC. | ||

+ | |||

+ | Let's assume that we want to departure from a <math>\ D</math> dimensional space to a <math>\ d</math> dimensional one, using the following mapping: | ||

+ | <math>\begin{align}Z=u.X \end{align}</math> | ||

+ | |||

+ | Where <math>\ X</math> is a <math>D\times n</math> matrix of the data points in the primary space, <math>\ Z</math> is a <math>d\times n</math> matrix of the the same data points in a reduced dimension space, and <math>\ u</math> is the <math>d\times D</math> mapping matrix. <math>\ n</math> is the total number of available data samples. | ||

+ | |||

+ | Here is the dual optimization problem we would like to solve: (you may find details on the primary problem in this paper: Zhang, Y., Zhi-Hua, Z., "Multi-label dimensionality reduction via dependence maximization", | ||

+ | Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)) | ||

+ | |||

+ | <math>\begin{align} | ||

+ | \max~&tr(u^T.X.H.B.H.X^T.u)\\ | ||

+ | s.t.~&u^Tu=I | ||

+ | \end{align}</math> | ||

+ | |||

+ | Where <math>\ H</math> is a centering matrix defined like this: <math>H=I-\frac{1}{n}e.e^T</math> and <math>\ e</math> is a <math>n\times 1</math> vector of all ones. And <math>\ B</math> is the transformed target labels (class labels) using a arbitrarily chosen kernel. | ||

+ | |||

+ | If one consider the matrix <math>\ S</math> as a <math>1\times D</math> vector of <math>\ X.H.B.H.X^T</math> eigenvalues in a descending order, so that <math>s_1>s_2>\ldots>s_D</math>, where <math>\ s_i</math> is the ith element of the Matrix <math>\ S</math>, then the optimal solution for this optimization problem would be a matrix whose columns are <math>\ d</math> eigenvectors corresponding to the first <math>\ d</math> eigenvalues. | ||

+ | |||

+ | [[File:012DR-PCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using PCA]] | ||

+ | [[File:012DR-SPCA.jpg|300px|thumb|right|Dimensionality Reduction of the 0-1-2 Data, Using Supervised PCA]] | ||

+ | |||

+ | And here is a Matlab function for supervised PCA, based on HSIC. | ||

+ | function [Z,u] = HSICPCA(X,Y,k) | ||

+ | %---------- Supervised Principal Component Analysis | ||

+ | %- X: samples, q*p | ||

+ | %- Y: class labels, q*1 and \in{1,2,...,C} | ||

+ | [q,p] = size(X); | ||

+ | C = max(Y); | ||

+ | X = sortrows([X,Y],p+1); | ||

+ | Y = X(:,p+1); | ||

+ | X = X(:,1:p); | ||

+ | B = zeros(q,q); | ||

+ | Q = zeros(1,C); | ||

+ | for i = 1:C | ||

+ | Q(i) = sum(Y==i); | ||

+ | B(sum(Q(1:i-1))+1:sum(Q(1:i)),sum(Q(1:i-1))+1:sum(Q(1:i))) = ones(Q(i),Q(i)); | ||

+ | end | ||

+ | H = eye(q) - ones(q,q)/q; | ||

+ | gamma = X'*H*B*H*X; | ||

+ | [V,D] = eig(gamma); | ||

+ | D = diag(abs(D)); | ||

+ | D = [D,(1:p)']; | ||

+ | D = sortrows(D,-1); | ||

+ | ind = zeros(1,p); | ||

+ | ind(D(1:k,2)) = 1; | ||

+ | ind = logical(ind); | ||

+ | u = V(:,ind); | ||

+ | Z = X*u; | ||

+ | |||

+ | and PCA | ||

+ | |||

+ | function [Y,X_h,w] = PCA(X,d) | ||

+ | %---------- Principal Component Analysis | ||

+ | %- X: p*q, Matrix of Samples (p: dimension of the space, q: no. of samples) | ||

+ | %- d: 1*1, Dimension of the New Space | ||

+ | %- Y: d*q, Mapped Data into the New Space | ||

+ | %- w: p*d, Matrix of Mapping | ||

+ | %- X_h: p*q, Reconstructed Data, Using the d Largest Eigen Values | ||

+ | q = length(X(1,:)); | ||

+ | mu = mean(X,2); | ||

+ | X_ao = X - mu*ones(1,q); | ||

+ | [U,S,V] = svd(X_ao); | ||

+ | X_h = U(:,1:d)*S(1:d,1:d)*V(:,1:d)'+mu*ones(1,q); | ||

+ | w = U(:,1:d); | ||

+ | Y = w'*X_ao; |

## Latest revision as of 09:45, 30 August 2017

## Contents

- 1 Schedule of Project Presentations
- 2 Proposal Fall 2010
- 3 Mark your contribution here
- 4 Editor sign up
- 5 Digest
- 6
**Reference Textbook** - 7
**Classification - September 21, 2010** - 8
**Linear and Quadratic Discriminant Analysis** - 9 Further reading
- 10
**Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010** - 11 Trick: Using LDA to do QDA - September 28, 2010
- 12
**Reference** - 13 Principal Component Analysis - September 30, 2010
- 14 Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010
- 15 Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010
- 16 Random Projection
- 17 Linear and Logistic Regression - October 12, 2010
- 18 Lecture summary
- 19 Logistic Regression Cont. - October 14, 2010
- 20
**Multi-Class Logistic Regression & Perceptron - October 19, 2010** - 21 Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010
- 21.1 Lecture Summary
- 21.2 Perceptron
- 21.3 An example of the determination on learning rate
- 21.4 Universal Function Approximator
- 21.5 Feed-Forward Neural Network
- 21.6 The Neural Network Toolbox in Matlab
- 21.7 Deep Neural Network
- 21.8 Neural Networks in Practice
- 21.9 Issues with Neural Network
- 21.10 Business Applications of Neural Networks
- 21.11 Further readings
- 21.12 References

- 22 Complexity Control - October 26, 2010
- 22.1 Lecture Summary
- 22.2 Over-fitting and Under-fitting
- 22.3
**How do we choose a good classifier?** - 22.4 References
- 22.5 Avoid Overfitting
- 22.6 Cross-Validation
- 22.7 K-Fold Cross-Validation
- 22.8 Leave-One-Out Cross-Validation - October 28, 2010
- 22.9 Matlab Code for Cross Validation
- 22.10 Further Reading
- 22.11 References

- 23 Radial Basis Function (RBF) Network - October 28, 2010
- 24
**Model Selection for RBF Network (Stein's Unbiased Risk Estimator) - November 2nd, 2010** - 25
**Regularization for Neural Network - November 4, 2010** - 26
**Support Vector Machine - November 09, 2010**- 26.1 Introduction
- 26.2 Optimal Separating Hyperplane
- 26.3 Some facts about the geometry of hyperplane
- 26.4 Writing Lagrangian Form of Support Vector Machine
- 26.5 Quadratic Programming Problem of SVMs and Dual Problem
- 26.6 Implementation
- 26.7 Hard margin SVM Algorithm
- 26.8 Multiclass SVM
- 26.9 Support Vector Machines vs Artificial Neural Networks
- 26.10 SVM packages
- 26.11 References

- 27
**Support Vector Machine Cont., Kernel Trick - November 11, 2010**- 27.1 Upper bound for Hard Margin SVM in MATLAB's Quadprog
- 27.2 Examining K.K.T. conditions
- 27.3 Support Vectors
- 27.4 The support vector machine algorithm
- 27.5 Advantages of Support Vector Machines
- 27.6 Disadvantages of Support Vector Machines [109]
- 27.7 Applications of Support Vector Machines
- 27.8 Kernel Trick
- 27.9 Example in Matlab
- 27.10 Support Vector Machines as a Regression Technique
- 27.11 1-norm support vector regression
- 27.12 2-norm support vector regression
- 27.13 Extension:Support Vector Machines
- 27.14 Further reading
- 27.15 References

- 28
**Support Vector Machine, Kernel Trick - Cont. Case II - November 16, 2010**- 28.1
**Case II: Non-separable data (Soft Margin)** - 28.2 Forming the Lagrangian
- 28.3 Applying KKT conditions[114]
- 28.4 Objective Function
- 28.5 Constraints
- 28.6 Dual Problem or Quadratic Programming Problem
- 28.7 Recovery of Hyperplane
- 28.8 SVM algorithm for non-separable data sets
- 28.9 Support Vectors
- 28.10 Support Vectors Machine Demo Tool
- 28.11 Relevance Vector Machines
- 28.12 Further reading on the Kernel Trick

- 28.1
- 29
**Naive Bayes, K Nearest Neighbours, Boosting, Bagging and Decision Trees, - November 18, 2010** - 30
**Project Presentations - November 23, 2010** - 31 Supervised PCA - December 3, 2010

## Schedule of Project Presentations

## Proposal Fall 2010

## Mark your contribution here

## Editor sign up

{{

Template:namespace detect

| type = style
| image =
| imageright =
| style =
| textstyle =
| text = This article **may require cleanup to meet Wikicoursenote's quality standards.** The specific problem is: *Provide a summary for each topic here.*. Please improve this article if you can. *(October 8 2010)*
| small =
| smallimage =
| smallimageright =
| smalltext =
}}

## Digest

** Reference Textbook**

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman (3rd Edition is available)

** Classification - September 21, 2010**

### Classification

**Statistical classification**, or simply known as classification, is an area of supervised learning that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A classifier is a specific technique or method for performing classification.
To classify new data, a classifier first uses labeled (classes are known) training data to train a model, and then it uses a function known as its classification rule to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.

Classification has been an important task for people and society since the beginnings of history. According to this link, the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being regression, clustering, and dimensionality reduction (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in this link, clustering is simply a special case of classification and it may be called **unsupervised classification**.

In **classical statistics**, classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When machine learning was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found here.

"We are drowning in information and starving for knowledge."- Rutherford D. Rogers

In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.

The formal mathematical definition of classification is as follows:

**Definition**: Classification is the prediction of a discrete random variable [math] \mathcal{Y} [/math] from another random variable [math] \mathcal{X} [/math], where [math] \mathcal{Y} [/math] represents the label assigned to a new data input and [math] \mathcal{X} [/math] represents the known feature values of the input.

A set of training data used by a classifier to train its model consists of [math]\,n[/math] independently and identically distributed (i.i.d) ordered pairs [math]\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}[/math], where the values of the [math]\,ith[/math] training input's feature values [math]\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}[/math] is a *d*-dimensional vector and the label of the [math]\, ith[/math] training input is [math]\,Y_{i} \in \mathcal{Y} [/math] that can take a finite number of values. The classification rule used by a classifier has the form [math]\,h: \mathcal{X} \mapsto \mathcal{Y} [/math]. After the model is trained, each new data input whose feature values is [math]\,x[/math] is given the label [math]\,\hat{Y}=h(x)[/math].

As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.

After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule [math]\ h [/math] to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.

As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data [math]\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}[/math], we could then use the classifier's classification rule [math]\,h[/math] to assign any newly-given fruit having known feature values [math]\,x = (\,x_{color}, x_{diameter} , x_{weight})[/math] the label [math]\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}[/math].

### Examples of Classification

• Email spam filtering (spam vs not spam).

• Detecting credit card fraud (fraudulent or legitimate).

• Face detection in images (face or background).

• Web page classification (sports vs politics vs entertainment etc).

• Steering an autonomous car across the US (turn left, right, or go straight).

• Medical diagnosis (classification of disease based on observed symptoms).

### Independent and Identically Distributed (iid) Data Assumption

Suppose that we have training data X which contains N data points. The Independent and Identically Distributed (IID) assumption declares that the datapoints are drawn independently from identical distributions. This assumption implies that ordering of the data points does not matter, and the assumption is used in many classification problems. For an example of data that is not IID, consider daily temperature: today's temperature is not independent of the yesterday's temperature -- rather, there is a strong correlation between the temperatures of the two days.

### Error rate

The **empirical error rate** (or **training error rate**) of a classifier having classification rule [math]\,h[/math] is defined as the frequency at which [math]\,h[/math] does not correctly classify the data inputs in the training set, i.e., it is defined as
[math]\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})[/math], where [math]\,I[/math] is an indicator variable and [math]\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.[/math]. Here,
[math]\,X_{i} \in \mathcal{X}[/math] and [math]\,Y_{i} \in \mathcal{Y}[/math] are the known feature values and the true class of the [math]\,ith[/math] training input, respectively.

The **true error rate** [math]\,L(h)[/math] of a classifier having classification rule [math]\,h[/math] is defined as the probability that [math]\,h[/math] does not correctly classify any new data input, i.e., it is defined as [math]\,L(h)=P(h(X) \neq Y)[/math]. Here, [math]\,X \in \mathcal{X}[/math] and [math]\,Y \in \mathcal{Y}[/math] are the known feature values and the true class of that input, respectively.

In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned here, it is an unbiased estimator of the true error rate.

An Error Rate Comparison of Classification Methods [1]

### Decision Theory

we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by:

a. First solve the inference problem of determining the class-conditional densities [math]\ p(x|C_k)[/math] for each class [math]\ C_k[/math] individually. Also separately infer the prior class probabilities [math]\ p(C_k)[/math]. Then use Bayes’ theorem in the form

[math]\begin{align}p(C_k|x)=\frac{p(x|C_k)p(C_k)}{p(x)} \end{align}[/math]

to find the posterior class probabilities [math]\ p(C_k|x)[/math]. As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because

[math]\begin{align}p(x)=\sum_{k} p(x|C_k)p(C_k) \end{align}[/math]

Equivalently, we can model the joint distribution [math]\ p(x, C_k)[/math] directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input [math]\ x[/math]. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as "generative models", because by sampling from them it is possible to generate synthetic data points in the input space.

b. First solve the inference problem of determining the posterior class probabilities [math]\ p(C_k|x)[/math], and then subsequently use decision theory to assign each new [math]\ x[/math] to one of the classes. Approaches that model the posterior probabilities directly are called "discriminative models".

c. Find a function [math]\ f(x)[/math], called a discriminant function, which maps each input [math]\ x[/math] directly onto a class label. For instance, in the case of two-class problems, [math]\ f(.)[/math] might be binary valued and such that [math]\ f = 0[/math] represents class [math]\ C_1[/math] and [math]\ f = 1[/math] represents class [math]\ C_2[/math]. In this case, probabilities play no role.

This topic has brought to you from Pattern Recognition and Machine Learning by Christopher M. Bishop (Chapter 1)

### Bayes Classifier

A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests[2].

An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

After training its model using training data, the **Bayes classifier** classifies any new data input in two steps. First, it uses the input's known feature values and the Bayes formula to calculate the input's posterior probability of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability.

In mathematical terms, for a new data input having feature values [math]\,(X = x)\in \mathcal{X}[/math], the Bayes classifier labels the input as [math](Y = y) \in \mathcal{Y}[/math], such that the input's posterior probability [math]\,P(Y = y|X = x)[/math] is maximum over all of the members of [math]\mathcal{Y}[/math].

Suppose there are [math]\,k[/math] classes and we are given a new data input having feature values [math]\,x[/math]. The following derivation shows how the Bayes classifier finds the input's posterior probability [math]\,P(Y = y|X = x)[/math] of belonging to each class [math] y \in \mathcal{Y} [/math].

- [math] \begin{align} P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\ &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align} [/math]

Here, [math]\,P(Y=y|X=x)[/math] is known as the posterior probability as mentioned above, [math]\,P(Y=y)[/math] is known as the prior probability, [math]\,P(X=x|Y=y)[/math] is known as the likelihood, and [math]\,P(X=x)[/math] is known as the evidence.

In the special case where there are two classes, i.e., [math]\, \mathcal{Y}=\{0, 1\}[/math], the Bayes classifier makes use of the function [math]\,r(x)=P\{Y=1|X=x\}[/math] which is the posterior probability of a new data input having feature values [math]\,x[/math] belonging to the class [math]\,Y = 1[/math]. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates [math]\,r(x)[/math] as follows:

- [math] \begin{align} r(x)&=P(Y=1|X=x) \\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)} \end{align} [/math]

The Bayes classifier's classification rule [math]\,h^*: \mathcal{X} \mapsto \mathcal{Y}[/math], then, is

- [math]\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.[/math].

Here, [math]\,x[/math] is the feature values of a new data input and [math]\hat r(x)[/math] is the estimated value of the function [math]\,r(x)[/math] given by the Bayes classifier's model after feeding [math]\,x[/math] into the model. Still in this special case of two classes, the Bayes classifier's decision boundary is defined as the set [math]\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}[/math]. The decision boundary [math]\,D(h)[/math] essentially combines together the trained model and the decision function [math]\,h^*[/math], and it is used by the Bayes classifier to assign any new data input to a label of either [math]\,Y = 0[/math] or [math]\,Y = 1[/math] depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as

- [math]\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } P(Y=1|X=x)\gt P(Y=0|X=x) \\ 0 &\mathrm{otherwise} \end{matrix}\right.[/math].

**Bayes Classification Rule Optimality Theorem**
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule [math]\,h[/math], it is always true that [math]\,L(h^*(x)) \le L(h(x))[/math]. Here, [math]\,L[/math] represents the true error rate, [math]\,h^*[/math] is the Bayes classifier's classification rule, and [math]\,x[/math] is any given data input's feature values.

Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.

A detailed proof of this theorem is available here.

**Defining the classification rule:**

In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule [math]\,h^*[/math]:

- 1) Empirical Risk Minimization: Choose a set of classifiers [math]\mathcal{H}[/math] and find [math]\,h^*\in \mathcal{H}[/math] that minimizes some estimate of the true error rate [math]\,L(h^*)[/math].

- 2) Regression: Find an estimate [math] \hat r [/math] of the function [math] x [/math] and define
- [math]\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.[/math].

- 3) Density Estimation: Estimate [math]\,P(X=x|Y=0)[/math] from the [math]\,X_{i}[/math]'s for which [math]\,Y_{i} = 0[/math], estimate [math]\,P(X=x|Y=1)[/math] from the [math]\,X_{i}[/math]'s for which [math]\,Y_{i} = 1[/math], and estimate [math]\,P(Y = 1)[/math] as [math]\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}[/math]. Then, calculate [math]\,\hat r(x) = \hat P(Y=1|X=x)[/math] and define
- [math]\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.[/math].

Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two.

**Multi-class classification:**

Suppose there are [math]\,k[/math] classes, where [math]\,k \ge 2[/math].

In the above discussion, we introduced the *Bayes formula* for this general case:

- [math] \begin{align} P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align} [/math]

which can re-worded as:

- [math] \begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align} [/math]

Here, [math]\,f_y(x) = P(X=x|Y=y)[/math] is known as the likelihood function and [math]\,\pi_y = P(Y=y)[/math] is known as the prior probability.

In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values [math]\,x[/math] into one of the [math]\,k[/math] classes.

**Theorem**

- Suppose that [math] \mathcal{Y}= \{1, \dots, k\}[/math], where [math]\,k \ge 2[/math]. Then, the optimal classification rule is [math]\,h^*(x) = arg max_{i} P(Y=i|X=x)[/math], where [math]\,i \in \{1, \dots, k\}[/math].

**Example:**
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by [math]\, \mathcal{Y}\in \{ 0,1 \} [/math], where 1 refers to *pass* and 0 refers to *fail*. Suppose that the prior probabilities are estimated or guessed to be [math]\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5[/math]. We have data on past student performances, which we shall use to train the model. For each student, we know the following:

- Whether or not the student’s GPA was greater than 3.0 (G).
- Whether or not the student had a strong math background (M).
- Whether or not the student was a hard worker (H).
- Whether or not the student passed or failed the course.
*Note: these are the known y values in the training data.*

These known data are summarized in the following tables:

For each student, his/her feature values is [math]\, x = \{G, M, H\} [/math] and his or her class is [math]\, y \in \{0, 1\} [/math].

Suppose there is a new student having feature values [math]\, x = \{0, 1, 0\}[/math], and we would like to predict whether he/she would pass the course. [math]\,\hat r(x)[/math] is found as follows:

[math]\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.125}=\frac{1}{5}\lt \frac{1}{2}.[/math]

The Bayes classifier assigns the new student into the class [math]\, h^*(x)=0 [/math]. Therefore, we predict that the new student would fail the course.

**Naive Bayes Classifier:**

The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function [math]\,f_y(x)[/math] in the equation:

- [math] \begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align} [/math]

The simper form of the likelihood function seen in the naive Bayes is:

- [math] \begin{align} f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)} \end{align} [/math]

The Bayes classifier taught in class was not the naive Bayes classifier.

### Bayesian vs. Frequentist

The Bayesian view of probability and the frequentist view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event.

The Bayesian view of probability states that, for any event E, event E has a prior probability that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a *belief* that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no *intrinsic* probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, [math]\,50%[/math] for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to Thomas Bayes (1702–1761).

In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an *intrinsic* probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined independent random trials. In each trial for an event, the event either occurs or it does not occur. Suppose
[math]n_x[/math] denotes the number of times that an event occurs during its trials and [math]n_t[/math] denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the *long run*, where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :[math]P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}[/math]. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as [math]P(x) \approx \frac{n_x}{n_t}[/math]. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher Aristotle. In his work *Rhetoric*, Aristotle gave the famous line "* the probable is that which for the most part happens*".

More information regarding the Bayesian and the frequentist schools of thought are available here. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available here.

There is useful information about Machine Learning, Neural and Statistical Classification in this link [2] Machine Learning, Neural and Statistical Classification; there is some description of Classification in chapter 2 Classical Statistical Methods in chapter 3 and Modern Statistical Techniques in chapter 4.

### Extension: Statistical Classification Framework

In statistical classification, each object is represented by 'd' (a set of features) a measurement vector, and the goal of classifier becomes finding compact and disjoint regions for classes in a d-dimensional feature space. Such decision regions are defined by decision rules that are known or can be trained. The simplest configuration of a classification consists of a decision rule and multiple membership functions; each membership function represents a class. The following figures illustrate this general framework.

Simple Conceptual Classifier.

Statistical Classification Framework

The classification process can be described as follows:

A measurement vector is input to each membership function. Membership functions feed the membership scores to the decision rule. A decision rule compares the membership scores and returns a class label.

**Linear and Quadratic Discriminant Analysis**

### Introduction

**Linear discriminant analysis** (LDA) and the related **Fisher's linear discriminant** are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is **discriminant correspondence analysis**.

### Content

First, we shall limit ourselves to the case where there are two classes, i.e. [math]\, \mathcal{Y}=\{0, 1\}[/math]. In the above discussion, we introduced the Bayes classifier's *decision boundary* [math]\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}[/math], which represents a hyperplane that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.

Let us denote the likelihood [math]\ P(X=x|Y=y) [/math] as [math]\ f_y(x) [/math] and the prior probability [math]\ P(Y=y) [/math] as [math]\ \pi_y [/math].

First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have multivariate normal (Gaussian) distributions and the two classes have the same covariance matrix [math]\, \Sigma[/math]. Under the assumptions of LDA, we have: [math]\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)[/math]. Now, to derive the Bayes classifier's decision boundary using LDA, we equate [math]\, P(Y=1|X=x) [/math] to [math]\, P(Y=0|X=x) [/math] and proceed from there. The derivation of [math]\,D(h^*)[/math] is as follows:

- [math]\,Pr(Y=1|X=x)=Pr(Y=0|X=x)[/math]
- [math]\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}[/math] (using Bayes' Theorem)
- [math]\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)[/math] (canceling the denominators)
- [math]\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0[/math]
- [math]\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0[/math]
- [math]\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0[/math]
- [math]\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)[/math] (taking the log of both sides).
- [math]\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0[/math] (expanding out)

- [math]\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1} \mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0[/math] (canceling out alike terms and factoring).

It is easy to see that, under LDA, the Bayes's classifier's decision boundary [math]\,D(h^*)[/math] has the form [math]\,ax+b=0[/math] and it is linear in [math]\,x[/math]. This is where the word *linear* in linear discriminant analysis comes from.

LDA under the two-classes case can easily be generalized to the general case where there are [math]\,k \ge 2[/math] classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes [math]\,m [/math] and [math]\,n[/math], then all we need to do is follow a derivation very similar to the one shown above, except with the classes [math]\,1 [/math] and [math]\,0[/math] being replaced by the classes [math]\,m [/math] and [math]\,n[/math]. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary [math]\,D(h^*)[/math] between classes [math]\,m [/math] and [math]\,n[/math] to be [math]\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0[/math] . In addition, for any two classes [math]\,m [/math] and [math]\,n[/math] for whom we would like to find the Bayes classifier's decision boundary using LDA, if [math]\,m [/math] and [math]\,n[/math] both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of [math]\,m [/math] and [math]\,n[/math].

The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in this link:

Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and revisited in 2000, is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.

According to this link, some of the limitations of LDA include:

- LDA implicitly assumes that the data in each class has a Gaussian distribution.
- LDA implicitly assumes that the mean rather than the variance is the discriminating factor.
- LDA may over-fit the training data.

The following link provides a comparison of discriminant analysis and artificial neural networks [3]

#### Different Approaches to LDA

Data sets can be transformed and test vectors can be classified in the transformed space by two different approaches.

- Class-dependent transformation: This type of approach involves maximizing the ratio of between

class variance to within class variance. The main objective is to maximize this ratio so that adequate class separability is obtained. The class-specific type approach involves using two optimizing criteria for transforming the data sets independently.

- Class-independent transformation: This approach involves maximizing the ratio of overall variance

to within class variance. This approach uses only one optimizing criterion to transform the data sets and hence all data points irrespective of their class identity are transformed using this transform. In this type of LDA, each class is considered as a separate class against all other classes.

## Further reading

The following are some applications that use LDA and QDA:

1- Linear discriminant analysis for improved large vocabulary continuous speech recognition here

2- 2D-LDA: A statistical linear discriminant analysis for image matrix here

3- Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition here

4- The sparse discriminant vectors are useful for supervised dimension reduction for high dimensional data. Naive application of classical Fisher’s LDA to high dimensional, low sample size settings suffers from the data piling problem. In [4] they have use sparse LDA method which selects important variables for discriminant analysis and thereby yield improved classification. Introducing sparsity in the discriminant vectors is very effective in eliminating data piling and the associated overfitting problem.

**Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010**

### LDA x QDA

Linear discriminant analysis[5] is a statistical method used to find the *linear combination* of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix [math]\, \Sigma[/math].

Quadratic Discriminant Analysis[6], on the other hand, aims to find the *quadratic combination* of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix [math]\, \Sigma[/math]. Instead, QDA makes the assumption that each class [math]\, k[/math] has its own covariance matrix [math]\, \Sigma_k[/math].

The derivation of the Bayes classifier's decision boundary [math]\,D(h^*)[/math] under QDA is similar to that under LDA. Again, let us first consider the two-classes case where [math]\, \mathcal{Y}=\{0, 1\}[/math]. This derivation is given as follows:

- [math]\,Pr(Y=1|X=x)=Pr(Y=0|X=x)[/math]
- [math]\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}[/math] (using Bayes' Theorem)
- [math]\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)[/math] (canceling the denominators)
- [math]\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0[/math]
- [math]\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0[/math]
- [math]\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0[/math] (by cancellation)
- [math]\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)[/math] (by taking the log of both sides)
- [math]\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0[/math] (by expanding out)
- [math]\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0[/math]

It is easy to see that, under QDA, the decision boundary [math]\,D(h^*)[/math] has the form [math]\,ax^2+bx+c=0[/math] and it is quadratic in [math]\,x[/math]. This is where the word *quadratic* in quadratic discriminant analysis comes from.

As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case wher