Difference between revisions of "stat841f10"

From statwiki
Jump to: navigation, search
(Linear Regression)
m (Conversion script moved page Stat841f10 to stat841f10: Converting page titles to lowercase)
 
Line 1: Line 1:
 +
==[[Schedule of Project Presentations]] ==
 +
==[[Proposal Fall 2010]] ==
 +
 +
==[[Mark your contribution here]]==
 +
 
==[[statf10841Scribe|Editor sign up]] ==
 
==[[statf10841Scribe|Editor sign up]] ==
 
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}
 
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}
Line 6: Line 11:
 
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]
 
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]
  
== ''' Classfication-2010.09.21''' ==
+
== ''' Classification - September 21, 2010''' ==
 
 
===Lecture Summary ===
 
 
 
* Classification is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that systematically assigns an unlabeled novel data to their label through the characteristics and attributes obtained from observation.
 
* Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>.
 
* True error rate is the probability that the classification rule <math>\,h</math> does not correctly classify any data input. Empirical error rate is the frequency where the classification rule <math>\,h</math> does not correctly classify any data input in the training set. In experimental tasks true error cannot be measured and as a result the empirical error rate is used as the estimate.
 
* Bayes Classifier is a probabilistic classifier by applying Bayes Theorem with strong (naive) independence assumptions. It has the advantage of requiring small training data to estimate the parameters needed for classification. Under this classifier an input <math>\,x</math> is classified to class <math>\,y</math> where the posterior probability for <math>\,y</math> is the largest for input <math>\,x</math>.
 
* Bayes Classification Rule Optimality Theorem states that Bayes classifier is the optimal classifier, in other words the true error rate of the Bayes classification rule will always be smaller or equal to any other classification rule
 
 
 
* Bayes Decision Boundary is the hyperplane boundary that separates the two classes <math>\,m, n</math> obtained by setting the posterior probability for the two classes equal, <math>\,D(h)=\{x: P(Y=m|X=x)=P(Y=n|X=x)\}</math>.
 
* Linear Discriminant Analysis (LDA) for the Bayes classifier decision boundary between two classes makes the assumption that both are generated from Gaussian distribution and have the same covariance matrix.
 
* PCA is an appropriate method when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. This is a powerful technique for dimensionally reduction. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.
 
  
 
=== Classification ===
 
=== Classification ===
Line 24: Line 17:
 
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.
 
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.
  
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.
+
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.
 +
 
 +
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].
  
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.
 
{{Cleanup|date=October 7th 2010|reason=We need a source for the following quote}}
 
 
         ''"We are drowning in information and starving for knowledge."''   
 
         ''"We are drowning in information and starving for knowledge."''   
                                                           - Rutherford D. Rogers
+
                                                           - Rutherford D. Rogers                        
        ''"We are drowning in information but starved for knowledge. This level of information is clearly impossible to be handled by present means. Uncontrolled and unorganized information is no longer a resource in an information society, instead it becomes the enemy."''
 
                                              -Megatrends 2000, John Naisbitt & Patricia Aburdene - Information Society - 1982, [http://www.naisbitt.com/bibliography/megatrends-2000.html],[http://www.nwlink.com/~donclark/history_knowledge/naisbitt.html]
 
  
 
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.
 
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.
Line 39: Line 30:
 
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input.  
 
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input.  
  
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.
+
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.
  
 
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.
 
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.
Line 50: Line 41:
  
 
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.
 
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.
 +
 +
=== Examples of Classification ===
 +
 +
• Email spam filtering (spam vs not spam).
 +
 +
• Detecting credit card fraud (fraudulent or legitimate).
 +
 +
• Face detection in images (face or background).
 +
 +
• Web page classification (sports vs politics vs entertainment etc).
 +
 +
• Steering an autonomous car across the US (turn left, right, or go straight).
 +
 +
• Medical diagnosis (classification of disease based on observed symptoms).
 +
 +
=== Independent and Identically Distributed (iid) Data Assumption ===
 +
 +
Suppose that we have training data X which contains N data points.  The Independent and Identically Distributed (IID) assumption declares that the datapoints are drawn independently from identical distributions.  This assumption implies that ordering of the data points does not matter, and the assumption is used in many classification problems.  For an example of data that is not IID, consider daily temperature: today's temperature is not independent of the yesterday's temperature -- rather, there is a strong correlation between the temperatures of the two days.
  
 
=== Error rate ===
 
=== Error rate ===
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate. -- This is what all the data-driven applications are about, empirical error}}
 
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. 
 
  
 
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as
 
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as
Line 59: Line 66:
 
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.
 
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.
  
=== Bayes Classifier ===
 
  
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".
+
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively.
  
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
 
  
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.
+
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.
  
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].
+
An Error Rate Comparison of Classification Methods [http://pdfserve.informaworld.com/311525_770885140_713826662.pdf]
  
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].
+
=== Decision Theory ===
 +
we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by:
  
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. 
+
a. First solve the inference problem of determining the class-conditional densities <math>\ p(x|C_k)</math> for each class <math>\ C_k</math> individually. Also separately infer the prior class probabilities <math>\ p(C_k)</math>. Then use Bayes’ theorem in the form
  
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.
+
<math>\begin{align}p(C_k|x)=\frac{p(x|C_k)p(C_k)}{p(x)} \end{align}</math>
  
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>.  
+
to find the posterior class probabilities <math>\ p(C_k|x)</math>. As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because
:<math>
 
\begin{align}
 
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\
 
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}
 
\end{align}
 
</math>
 
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.
 
  
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows:  
+
<math>\begin{align}p(x)=\sum_{k} p(x|C_k)p(C_k) \end{align}</math>
 +
 
 +
Equivalently, we can model the joint distribution <math>\ p(x, C_k)</math> directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input <math>\ x</math>. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as "generative models", because by sampling from them it is possible to generate synthetic data points in the input space.
 +
 
 +
b. First solve the inference problem of determining the posterior class probabilities <math>\ p(C_k|x)</math>, and then subsequently use decision theory to assign each new <math>\ x</math> to one of the classes. Approaches that model the posterior probabilities directly
 +
are called "discriminative models".
 +
 
 +
c. Find a function <math>\ f(x)</math>, called a discriminant function, which maps each input <math>\ x</math> directly onto a class label. For instance, in the case of two-class problems, <math>\ f(.)</math> might be binary valued and such that <math>\ f = 0</math> represents class <math>\ C_1</math> and <math>\ f = 1</math> represents class <math>\ C_2</math>. In this case, probabilities play no role.
 +
 
 +
This topic has brought to you from Pattern Recognition and Machine Learning by Christopher M. Bishop (Chapter 1)
 +
 
 +
=== Bayes Classifier ===
 +
 
 +
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".
 +
 
 +
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
 +
 
 +
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.
 +
 
 +
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].
 +
 
 +
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].
 +
 
 +
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. 
 +
 
 +
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.
 +
 
 +
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>.
 +
:<math>
 +
\begin{align}
 +
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\
 +
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}
 +
\end{align}
 +
</math>
 +
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.
 +
 
 +
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows:  
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
Line 99: Line 134:
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.  
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.  
  
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as
+
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as
  
 
:<math>\, h^*(x)= \left\{\begin{matrix}  
 
:<math>\, h^*(x)= \left\{\begin{matrix}  
Line 106: Line 141:
  
 
'''Bayes Classification Rule Optimality Theorem'''  
 
'''Bayes Classification Rule Optimality Theorem'''  
The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values.  
+
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values.  
  
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.
+
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.
  
A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].
+
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].
  
 
'''Defining the classification rule:'''
 
'''Defining the classification rule:'''
  
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:
+
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:
  
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math>  that minimizes some estimate of the true error rate <math>\,L(h)</math>.
+
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math>  that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.
  
 
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define  
 
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define  
:<math>\, h(x)= \left\{\begin{matrix}  
+
:<math>\, h^*(x)= \left\{\begin{matrix}  
 
1 &\text{if }  \hat r(x)>\frac{1}{2}  \\  
 
1 &\text{if }  \hat r(x)>\frac{1}{2}  \\  
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.
  
 
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define   
 
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define   
:<math>\, h(x)= \left\{\begin{matrix}  
+
:<math>\, h^*(x)= \left\{\begin{matrix}  
 
1 &\text{if }  \hat r(x)>\frac{1}{2}  \\  
 
1 &\text{if }  \hat r(x)>\frac{1}{2}  \\  
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.
 
0 &\mathrm{otherwise}  \end{matrix}\right.</math>.
Line 153: Line 188:
 
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.
 
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.
  
''Theorem''
+
'''Theorem'''
 
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>.  
 
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>.  
  
Line 161: Line 196:
 
:Whether or not the student had a strong math background (M).
 
:Whether or not the student had a strong math background (M).
 
:Whether or not the  student was a hard worker (H).
 
:Whether or not the  student was a hard worker (H).
:Whether or not the student passed or failed the course.
+
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.''
  
 
These known data are summarized in the following tables:
 
These known data are summarized in the following tables:
Line 172: Line 207:
  
 
<br />
 
<br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br />
+
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.125}=\frac{1}{5}<\frac{1}{2}.</math><br />
  
 
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.
 
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.
 +
 +
'''Naive Bayes Classifier:'''
 +
 +
The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:
 +
:<math>
 +
\begin{align}
 +
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}
 +
\end{align}
 +
</math>
 +
The simper form of the likelihood function seen in the naive Bayes is:
 +
:<math>
 +
\begin{align}
 +
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}
 +
\end{align}
 +
</math>
 +
The Bayes classifier taught in class was not the naive Bayes classifier.
  
 
=== Bayesian vs. Frequentist ===
 
=== Bayesian vs. Frequentist ===
Line 180: Line 231:
 
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event.  
 
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event.  
  
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).
+
 
 +
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).
 +
 
  
 
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose  
 
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose  
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''"<ref name="aristorhetor">''Rhetoric'' Bk 1 Ch 2; discussed in J. Franklin, ''The Science of Conjecture: Evidence and Probability Before Pascal'' (2001), The Johns Hopkins University Press. ISBN 0801865697 , p. 110.</ref>.
+
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".
 +
 
  
 
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].
 
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].
 +
 +
There is useful information about Machine Learning, Neural and Statistical Classification in this link [http://www.amsta.leeds.ac.uk/~charles/statlog/] Machine Learning, Neural and Statistical Classification; there is some description of Classification in chapter 2 Classical Statistical Methods in chapter 3 and  Modern Statistical Techniques in chapter 4.
 +
 +
=== Extension: Statistical Classification Framework ===
 +
 +
In statistical classification, each object is represented by 'd' (a set of features) a measurement vector, and the goal of classifier becomes finding compact and disjoint regions for classes in a d-dimensional feature space. Such decision regions are defined by decision rules that are known or can be trained. The simplest configuration of a classification consists of a decision rule and multiple membership functions; each membership function represents a class. The following figures illustrate this general framework.
 +
 +
[[File:cs1.png]]
 +
 +
Simple Conceptual Classifier.
 +
 +
[[File:cs2.png]]
 +
 +
[http://www.orfeo-toolbox.org/SoftwareGuide/SoftwareGuidech17.html#x44-2480011 Statistical Classification Framework]
 +
 +
 +
The classification process can be described as follows:
 +
 +
A measurement vector is input to each membership function.
 +
Membership functions feed the membership scores to the decision rule.
 +
A decision rule compares the membership scores and returns a class label.
  
 
== '''Linear and Quadratic Discriminant Analysis'''  ==
 
== '''Linear and Quadratic Discriminant Analysis'''  ==
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input, depending on which side of the decision boundary it lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.
+
 
 +
===Introduction===
 +
'''Linear discriminant analysis''' ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) and the related '''Fisher's linear discriminant''' are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
 +
 
 +
LDA is also closely related to principal component analysis ([http://en.wikipedia.org/wiki/Principal_component_analysis PCA]) and [http://en.wikipedia.org/wiki/Factor_analysis factor analysis] in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.
 +
 
 +
LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is '''discriminant correspondence analysis'''.
 +
 
 +
=== Content ===
 +
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.
  
 
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.
 
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.
  
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes are have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution]s and also the two classes have the same covariance matrix, <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:
+
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:
  
 
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math>
 
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math>
Line 206: Line 290:
 
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)  \right)=0</math> (canceling out alike terms and factoring).
 
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)  \right)=0</math> (canceling out alike terms and factoring).
  
:<math>\,\Rightarrow  -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)
+
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.
  
<math>\,  -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.
 
  
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA; this is where the word ''linear'' in linear discriminant analysis comes from.
+
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left(  \mu_m^\top\Sigma^{-1}
 +
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)  \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.
  
  
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\,  -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.
+
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:
  
  
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:
+
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.
 +
 
  
 +
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:
  
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.
+
* LDA implicitly assumes that the data in each class has a Gaussian distribution.
 +
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.
 +
* LDA may over-fit the training data.
  
 +
The following link provides a comparison of discriminant analysis and artificial neural networks [http://www.jstor.org/stable/2584434?seq=4]
  
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}}
+
====Different Approaches to LDA====
{{Cleanup|date=September 2010|reason=Even though in LDA, we assume equality of covariance matrices of the two classes, it doesn't mean that we do not take into consideration the covariance matrix, and as the resulting decision boundary suggests, the covariance matrix affects the final decision. And about the over fitting problem, this is what every single classifier suffers from, and here is where the generalization capabilities come up and Vapnik-Chernovenkis define their dimension and go on}}
+
Data sets can be transformed and test vectors can be classified in the transformed space by two
 +
different approaches.
  
 +
*Class-dependent transformation: This type of approach involves maximizing the ratio of between
 +
class variance to within class variance. The main objective is to maximize this ratio so that adequate
 +
class separability is obtained. The class-specific type approach involves using two optimizing criteria
 +
for transforming the data sets independently.
  
{{Cleanup|date=October 2010|reason=The first limitation of LDA is something that is not specifically a limitation of LDA. This is a limitation of QDA as well unless we are using Kernel QDA and relaxing this assumption}}
+
*Class-independent transformation: This approach involves maximizing the ratio of overall variance
+
to within class variance. This approach uses only one optimizing criterion to transform the data sets
 +
and hence all data points irrespective of their class identity are transformed using this transform. In
 +
this type of LDA, each class is considered as a separate class against all other classes.
  
Some of the limitations of LDA include:
+
== Further reading  ==
 +
The following are some applications that use LDA and QDA:
  
* LDA implicitly assumes that each class has a Gaussian distribution.
+
1- Linear discriminant analysis for improved large vocabulary continuous speech recognition [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=225984 here]
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.
 
* LDA may over-fit the training data.
 
  
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==
+
2- 2D-LDA: A statistical linear discriminant analysis for image matrix  [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V15-4DK6B5P-4-1&_cdi=5665&_user=1067412&_pii=S0167865504002272&_origin=search&_coverDate=04%2F01%2F2005&_sk=999739994&view=c&wchp=dGLzVlz-zSkzV&md5=60ea1cf7ff045f76421f5bde64bf855a&ie=/sdarticle.pdf here]
  
===Lecture Summary ===
+
3- Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition  [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V15-4DTJVF4-2-9&_cdi=5665&_user=1067412&_pii=S0167865504002260&_origin=search&_coverDate=01%2F15%2F2005&_sk=999739997&view=c&wchp=dGLzVtb-zSkzk&md5=1bba55e357b1c79579987638dcbf6828&ie=/sdarticle.pdf here]
  
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA).  
+
4- The sparse discriminant vectors are useful for supervised dimension reduction for high dimensional data.
 +
Naive application of classical Fisher’s LDA to high dimensional, low sample size settings suffers from the data piling problem.  In [http://www.iaeng.org/IJAM/issues_v39/issue_1/IJAM_39_1_06.pdf] they have use sparse LDA method which selects important variables for discriminant analysis and thereby
 +
yield improved classification. Introducing sparsity in the discriminant vectors is very effective in eliminating data piling and the associated overfitting
 +
problem.
  
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.
+
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==
  
 
===LDA x QDA===
 
===LDA x QDA===
  
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research.  
+
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.
 +
 
 +
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.
 +
 
 +
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows:
 +
 
 +
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math>
 +
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)
 +
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)
 +
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math>
 +
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math>
 +
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)
 +
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)
 +
:<math>\,\Rightarrow  \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left(  x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0  \right)=0</math> (by expanding out)
 +
:<math>\,\Rightarrow  \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left(  x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0)  \right)=0</math>
 +
 
 +
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.
  
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.
+
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left(  x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n)  \right)=0</math>.
  
 
===Summarizing LDA and QDA===
 
===Summarizing LDA and QDA===
Line 258: Line 372:
  
 
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is
 
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math>  
+
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math>  
where   
+
where,  
:::<math> \,\delta_k  = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>   (quadratic)
+
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.
 
 
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>.  
 
  
If the covariance of the Gaussians are the same, this becomes
+
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x)  = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.
  
:::<math> \,\delta_k  = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>  (linear)
 
  
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.
+
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.
  
 
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]
 
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]
  
 
===In practice===
 
===In practice===
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.
+
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.
 
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]]  
 
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]]  
  
Line 282: Line 393:
 
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math>
 
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math>
  
 +
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class.
  
Common covariance is defined by the average sample covariance.
+
In the case where we need a common covariance matrix, we get the estimate using the following equation:
  
In the case where we have a common covariance matrix, we get the ML estimate to be
+
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math>
  
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math>
+
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,
 +
<math>\,k</math> is the number of classes.
  
This is a Maximum Likelihood estimate.
+
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].
  
===Computation===
+
===Computation For QDA And LDA===
  
 +
First, let us consider QDA, and examine each of the following two cases.
  
 
'''Case 1: (Example) <math>\, \Sigma_k = I </math>
 
'''Case 1: (Example) <math>\, \Sigma_k = I </math>
Line 298: Line 412:
 
[[File:case1.jpg|300px|thumb|right]]  
 
[[File:case1.jpg|300px|thumb|right]]  
  
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.
+
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.
  
 
We have:
 
We have:
Line 304: Line 418:
 
<math> \,\delta_k  = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math>
 
<math> \,\delta_k  = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math>
  
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>.  In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical.
+
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>.   
  
  
Line 311: Line 425:
 
We can decompose this as:
 
We can decompose this as:
  
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>.  
+
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>.  
So if <math>\, X</math>  is symmetric. we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric ,because it is the covariance matrix of <math> X </math>)
+
So if <math>\, X_k</math>  is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is
 
 
and the inverse of <math>\,\Sigma_k</math> is
 
  
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)
+
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)
  
 
So from the formula for <math>\,\delta_k</math>, the second term is
 
So from the formula for <math>\,\delta_k</math>, the second term is
  
 
:<math>\begin{align}
 
:<math>\begin{align}
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\
+
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\
+
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\
+
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\
+
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\
+
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.
+
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.
 +
 
 +
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.
  
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.
+
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.
  
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.
+
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above. This strategy is correct because by transforming <math>\, x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>
  
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method.  So this method works for LDA.
+
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .
 +
 
 +
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method.  So this method works for LDA.
  
 
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?
 
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?
  
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.
+
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math>  and <math> \,\delta_2 </math> .
  
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]
+
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which is not the case in real-life practice. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.
 
  
===The Number of Parameters in LDA and QDA===
+
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.
  
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.
+
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x_k^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.
  
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.
+
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x_k^*</math> and the transformed center <math>\,\mu_k^*</math> of each class <math>\,k</math>, and assign <math>\,x</math> to class <math>\,k</math> such that the squared Euclidean distance between <math>\,x_k^*</math> and <math>\,\mu_k^*</math> is the least for all possible <math>\,k</math>'s.
  
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.
 
  
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]
+
Now, let us consider LDA.  
 +
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.
  
== Trick: Using LDA to do QDA ==
+
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.
+
 
 +
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.
 +
 
 +
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.
 +
 
 +
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.
 +
 
 +
 
 +
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]
 +
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian,  which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.
 +
 
 +
===The Number of Parameters in LDA and QDA===
 +
 
 +
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.
 +
 
 +
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.
 +
 
 +
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.
 +
 
 +
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]
 +
 
 +
===More information on Regularized Discriminant Analysis (RDA)===
 +
Discriminant analysis (DA) is widely used in classification problems. Except LDA and QDA, there is also an intermediate method between LDA and QDA, a regularized version of discriminant analysis (RDA) proposed by Friedman [1989], and it has been shown to be more flexible in dealing with various class distributions. RDA applies the regularization techniques by using two regularization parameters, which are selected to jointly maximize the classification performance. The optimal pair of parameters is commonly estimated via cross-validation from a set of candidate pairs. More detail about this method can be found in the book by Hastie et al. [2001]. On the other hand, the time of computing last long for high dimensional data, especially when the candidate set is large, which limits the applications of RDA to low dimensional data. In 2006, Ye Jieping and Wang Tie develop a novel algorithm for RDA for high dimensional data. It can estimate the optimal regularization parameters from a large set of parameter candidates efficiently. Experiments on a variety of datasets confirm the claimed theoretical estimate of the efficiency, and also show that, for a properly chosen pair of regularization parameters, RDA performs favourably in classification, in comparison with other existing classification methods. For more details, see Ye, Jieping; Wang, Tie
 +
Regularized discriminant analysis for high dimensional, low sample size data Conference on Knowledge Discovery in Data: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 20-23 Aug. 2006
 +
 
 +
===Further Reading for Regularized Discriminant Analysis (RDA)===
 +
 
 +
1. Regularized Discriminant Analysis and Reduced-Rank LDA
 +
[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda2.pdf]
 +
 
 +
2. Regularized discriminant analysis for the small sample size in face recognition
 +
[http://www.google.ca/url?sa=t&source=web&cd=2&sqi=2&ved=0CCQQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.84.6960%26rep%3Drep1%26type%3Dpdf&rct=j&q=Regularized%20Discriminant%20Analysis&ei=IPr2TJ_2MKWV4gaP5eH-Bg&usg=AFQjCNHB3fk6eVe5fSjlQCMfK44kU1-lug&sig2=5EJv_AV3W_ngSVFIa1nfRg&cad=rja.pdf]
 +
 
 +
3. Regularized Discriminant Analysis and Its Application in Microarrays
 +
[http://www-stat.stanford.edu/~hastie/Papers/RDA-6.pdf]
 +
 
 +
== Trick: Using LDA to do QDA - September 28, 2010==
 +
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.
  
 
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.
 
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.
Line 369: Line 521:
 
Suppose we can estimate some vector <math>\underline{w}^T</math> such that
 
Suppose we can estimate some vector <math>\underline{w}^T</math> such that
  
<math>y = \underline{w}^Tx</math>
+
<math>y = \underline{w}^T\underline{x}</math>
  
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).
+
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">\underline{x}\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).
  
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.
+
We also have a non-linear function <math>g(x) = y = \underline{x}^Tv\underline{x} + \underline{w}^T\underline{x}</math> that we cannot estimate.
  
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:
+
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,\underline{x}^*</math> such that:
  
 
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math>
 
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math>
Line 381: Line 533:
 
and
 
and
  
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math>
+
<math>\underline{x}^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math>
  
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.
+
We can then estimate a new function, <math>g^*(\underline{x},\underline{x}^2) = y^* = \underline{w}^{*T}\underline{x}^*</math>.
  
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.
+
Note that we can do this for any <math>\, x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.
  
 
=== By Example ===
 
=== By Example ===
Line 425: Line 577:
 
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.
 
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.
  
=== LDA and QDA in Matlab ===
+
===Working Example - Diabetes Data Set===
 +
 
 +
Let's take a look at specific data set. This is a [http://archive.ics.uci.edu/ml/datasets/Diabetes diabetes data set] from the UC Irvine Machine Learning Repository. It is a fairly small data set by today's standards. The original data had eight variable dimensions. What I did here was to obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example.
  
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.
+
The Diabetes data set has two types of samples in it. One sample type are healthy individuals the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk:
  
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.
+
[[File:eq1.png]]
  
>> load 2_3;
+
The first type has a prior probability estimated at 0.651. This means that among the data set, (250 to 300 data points), about 65% of these belong to class one and the other 35% belong to class two. Next, we computed the mean vector for the two classes separately:[[File:eq2.png]]
>> [U, sample] = princomp(X');
 
>> sample = sample(:,1:2);
 
  
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.
+
Then we computed [[File:eq3.jpg]] using the formulas discussed earlier.
 +
 
 +
Once we have done all of this, we compute the linear discriminant function and found the classification rule. Classification rule:[[File:eq4.jpg]]
 +
 
 +
In this example, if you give me an <math>\, x</math>, I then plug this value into the above linear function. If the result is greater than or equal to zero, I claim that it is in class one. Otherwise, it is in class two.
 +
Below is a scatter plot of the dominant principle components. The two classes are represented. The first, without diabetes, is shown with red stars (class 1), and the second class, with diabetes, is shown with blue circles (class 2). The solid line represents the classification boundary obtained by LDA. It appears the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of indicator matrix. In this case, the results of the two different linear boundaries are very close.
 +
 
 +
[[File:eq5.jpg]]
 +
 
 +
It is always good practice to visualize the scheme to check for any obvious mistakes.
 +
 
 +
• Within training data classification error rate: 28.26%.
 +
• Sensitivity: 45.90%.
 +
• Specificity: 85.60%.
 +
 
 +
Below is the contour plot for the density of the diabetes data (the marginal density for <math>\, x</math> is a mixture of two Gaussians, 2 classes). It looks like a single Gaussian distribution. The reason for this is that the two classes are so close together that they merge into a single mode.
 +
 
 +
[[File:eq6.jpg]]
 +
 
 +
=== LDA and QDA in Matlab ===
 +
 
 +
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.
 +
 
 +
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.
 +
 
 +
>> load 2_3;
 +
>> [U, sample] = princomp(X');
 +
>> sample = sample(:,1:2);
 +
 
 +
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.
  
 
   
 
   
  >> plot (sample(1:200,1), sample(1:200,2), 'b.');
+
  >> plot (sample(1:200,1), sample(1:200,2), '.');
 
  >> hold on;
 
  >> hold on;
 
  >> plot (sample(201:400,1), sample(201:400,2), 'r.');
 
  >> plot (sample(201:400,1), sample(201:400,2), 'r.');
Line 455: Line 636:
 
  >> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');
 
  >> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');
  
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.
+
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.
  
 
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.
 
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.
Line 463: Line 644:
 
     369
 
     369
  
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.
+
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.
  
 
:We can see the line produced by LDA using <code>coeff</code>.
 
:We can see the line produced by LDA using <code>coeff</code>.
Line 485: Line 666:
 
  >> l = coeff(1,2).linear;
 
  >> l = coeff(1,2).linear;
 
  >> q = coeff(1,2).quadratic;
 
  >> q = coeff(1,2).quadratic;
  >> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));
+
  >> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));
 
  >> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);
 
  >> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);
  
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]
+
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]
  
 
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.
 
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.
  
 
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''
 
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.
+
<br />In our assignment 1, we learned how to perform Principal Component Analysis using the SVD (Singular Value Decomposition) method. In fact, matlab offers a built-in function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which performs PCA. From the matlab help file on <code>princomp</code>, you can find the details about this function. Here we will analyze Matlab's <code>princomp()</code> code. We find something different than the SVD method we used on our first assignment. The following is Matlab's code for princomp followed by some explanations to emphasize some key steps.
  
 
     function [pc, score, latent, tsquare] = princomp(x);
 
     function [pc, score, latent, tsquare] = princomp(x);
Line 528: Line 709:
 
     tsquare = sum(tmp.*tmp)';
 
     tsquare = sum(tmp.*tmp)';
  
From the above code, we should pay attention to the following aspects when comparing with SVD method:
+
We should compare the following aspects of the above code with the SVD method:
  
 
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.
 
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.
Line 538: Line 719:
 
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.
 
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.
  
The following is an example to perform PCA using princomp and SVD respectively to get the same results.
+
The following is an example to perform PCA using princomp and SVD respectively to get the same result.
 
:SVD method
 
:SVD method
 
   >> load 2_3
 
   >> load 2_3
Line 551: Line 732:
 
Then we can see that y=score, v=U.
 
Then we can see that y=score, v=U.
  
'''useful resouces:'''
+
'''useful resources:'''
 
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]
 
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]
  
Line 577: Line 758:
 
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]
 
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]
  
 +
Using discriminant analysis for multi-class classification: an experimental investigation [http://www.springerlink.com/content/6851416084227k8p/fulltext.pdf]
 +
 +
===Reference articles on solving a small sample size problem when LDA is applied===
 +
( Based on Li-Fen Chen, Hong-Yuan Mark Liao, Ming-Tat Ko, Ja-Chen Lin, Gwo-Jong Yu  A new LDA-based face recognition system which can solve the small sample size problem Pattern Recognition 33 (2000) 1713-1726 )
 +
 +
Small sample size indicates that the number of samples is smaller than the dimension of each sample. In this case, the within-class covariance we stated in class could be a singular matrix and naturally we cannot find its inverse matrix for further analysis.However, many researchers tried to solve it by different techniques:<br />
 +
1.Goudail et al. proposed a technique which calculated 25 local autocorrelation coefficients from each sample image to achieve dimensionality reduction. (Referenced by F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, N. Otsu, Face recognition system using local autocorrelations and multiscale integration, IEEE Trans. Pattern Anal. Mach. Intell. 18 (10) (1996) 1024-1028.)<br />
 +
2.Swets and Weng applied the PCA approach to accomplish reduction of image dimensionality. (Referenced by D. Swets, J. Weng, Using discriminant eigen features for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell.18 (8) (1996) 831-836.)<br />
 +
3.Fukunaga proposed a more efficient algorithm and calculated eigenvalues and eigenvectors from an m*m matrix, where n is the dimensionality of the samples and m is the rank of the within-class scatter matrix Sw. (Referenced by K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990.)<br />
 +
4.Tian et al. used a positive pseudoinverse matrix instead of calculating the inverse matrix Sw. (Referenced by Q. Tian, M. Barbero, Z.H. Gu, S.H. Lee, Image classification by the Foley-Sammon transform, Opt. Eng. 25 (7) (1986) 834-840.)<br />
 +
5.Hong and Yang tried to add the singular value perturbation in Sw and made Sw a nonsingular matrix. (Referenced by Zi-Quan Hong, Jing-Yu Yang, Optimal discriminant plane for a small number of samples and design method of classifier on the plane, Pattern Recognition 24 (4) (1991) 317-324)<br />
 +
6.Cheng et al. proposed another method based on the principle of rank decomposition of matrices. The above three methods are all based on the conventional Fisher's criterion function. (Referenced by Y.Q. Cheng, Y.M. Zhuang, J.Y. Yang, Optimal fisher discriminant analysis using the rank decomposition, Pattern Recognition 25 (1) (1992) 101-111.)<br />
 +
7.Liu et al. modified the conventional Fisher's criterion function and conducted a number of researches based on the new criterion function. They used the total scatter matrix as the divisor of the original Fisher's function instead of merely using the within-class scatter matrix. (Referenced by K. Liu, Y. Cheng, J. Yang, A generalized optimal set of discriminant vectors, Pattern Recognition 25 (7) (1992) 731-739.)
 +
 +
==Principal Component Analysis  - September 30, 2010==
 +
 +
===Brief introduction on dimension reduction method===
 +
 +
Dimension reduction is a process to reduce the number of variables of the data by some techniques. [http://en.wikipedia.org/wiki/Principal_component_analysis Principal components analysis] (PCA) and factor analysis are two primary classical methods on dimension reduction. PCA is a method to create some new variables by a linear combination of the variables in the data and the number of new variables depends on what proportion of the variance the new ones contribute. On the contrary, factor analysis method tries to express the old variables by the linear combination of new variables. So before creating the expressions, a certain number of factors should be determined firstly by analysis on the features of old variables. In general, the idea of both PCA and factor analysis is to use as less as possible mixed variables to reflect as more as possible information.
  
==Principal Component Analysis ==
 
 
===Rough definition===
 
===Rough definition===
  
Line 586: Line 785:
  
 
<br />
 
<br />
PCA is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf].  
+
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.  
  
Suppose X is our data matrix with the data points along the rows and the dimensions along the columns. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to X so that one can use a smaller number of [http://en.wikipedia.org/wiki/Uncorrelated uncorrelated] columns of X in place of all of the columns of X, the latter of which may have many that are [http://en.wikipedia.org/wiki/Correlation_and_dependence correlated] with each other.
 
  
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.
+
Furthermore, if one considers the lower dimensional representation produced by PCA as a least square fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted, however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA should be selected to be the most informative dimensions in order to create the lower-dimensional representation.  
  
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.
 
  
===Principal Components of handwritten digits===
+
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space.  
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes.  
 
{{Cleanup|date=September 6 2010|reason=This figure is copyrighted. Please remove and replace it with an appropriate one. You can produce an image yourself using 3 digits in 2-3 data set for example. }}
 
[[File:threes_dataset.png]]
 
  
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).
 
  
[[File:matrix_decomp_PCA.png]]
+
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.
 +
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>).
  
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.
 
  
[[File:linear_comb_PCA.png]]
+
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:
  
To demonstrate this process, we can compare the images of 2s and 3s.  We will apply PCA to the data, and compare the images of the labeled data.  This is an example in classifying.
+
:: Step 1: Center <math>\,X</math> so that it would have zero mean.
 +
 
 +
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.
 +
 
 +
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.
 +
 
 +
 
 +
PCA takes a sample of <math>\, d</math> - dimensional vectors and produces an orthogonal(zero covariance) set of <math>\, d</math> 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.
 +
 
 +
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first <math>\, k</math> Principle Components and approximating the data in <math>\, k</math> - dimensional space, which is easier to analyze and plot.
 +
 
 +
===Principal Components of handwritten digits===
 +
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes.
 +
 
 +
 
 +
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).
 +
 
 +
[[File:matrix_decomp_PCA.png]]
 +
 
 +
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.
 +
 
 +
[[File:linear_comb_PCA.png]]
 +
 
 +
To demonstrate this process, we can compare the images of 2s and 3s.  We will apply PCA to the data, and compare the images of the labeled data.  This is an example in classifying.
  
 
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.
 
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.
Line 613: Line 829:
 
[[Image:23plotPCA.jpg‎]]
 
[[Image:23plotPCA.jpg‎]]
  
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.
+
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.
  
 
===Derivation of the first Principle Component===
 
===Derivation of the first Principle Component===
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}
+
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math>  be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.
+
For finding the direction of maximum variation, Let <math>\begin{align}\textbf{w}\end{align}</math>  be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point, and <math>\begin{align}\displaystyle u\end{align}</math> be the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in the direction <math>\begin{align}\textbf{w}\end{align}</math>.
 
<br /><br />
 
<br /><br />
 
<math>\begin{align}
 
<math>\begin{align}
Line 626: Line 842:
 
</math>
 
</math>
 
<br /><br />
 
<br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br>
+
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality we assume that: <br>
 
<br />
 
<br />
 
<math>
 
<math>
 
\begin{align}
 
\begin{align}
 
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\
 
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\
u &= \textbf{w}^T \textbf{x}
+
u &= \textbf{w}^T \textbf{x}.
 
\end{align}
 
\end{align}
 
</math>
 
</math>
 
<br /><br />
 
<br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,
+
Let <math>x_1, \ldots, x_D</math> be random variables, then we set our goal as to maximize the variance of <math>u</math>,
 
<br /><br />
 
<br /><br />
 
<math>
 
<math>
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w},
+
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}.
 
</math>
 
</math>
 
<br /><br />
 
<br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix,
+
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>. The sample covariance matrix  
 
<br /><br />
 
<br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math>
+
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math>
 
<br /><br />
 
<br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,
+
The above mentioned variable is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,
 
<br /><br />
 
<br /><br />
 
<math>
 
<math>
Line 651: Line 867:
 
</math>
 
</math>
 
<br /><br />
 
<br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The constraint in this case is fixing the size of the function, since this is a convex function that has no maximum value; however, we are interested only in the direction of the variability.<br /> The problem then becomes,
+
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes
 
<br /><br />
 
<br /><br />
 
<math>
 
<math>
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right)
+
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right)  
 
</math>
 
</math>
c.t.
+
<br /><br />
<math>\textbf{w}^T \textbf{w} = 1</math>
+
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math>
 
<br /><br />
 
<br /><br />
 
Notice,<br />
 
Notice,<br />
 
<br />
 
<br />
 
<math>
 
<math>
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| =  \| s \|
+
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| =  \| s \|.
 
</math>
 
</math>
 
<br /><br />
 
<br /><br />
Line 669: Line 885:
 
====Lagrange Multiplier====
 
====Lagrange Multiplier====
  
Before we can proceed, we must review Lagrange Multipliers.
+
Before we can proceed, we must review Lagrange multipliers.
  
 
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]
 
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]
Line 688: Line 904:
  
 
====Example====
 
====Example====
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:
+
Suppose we wish to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:
  
 
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math>
 
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math>
Line 711: Line 927:
 
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0.  
 
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0.  
  
If <math> \textbf{w}^T  \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math>
+
If <math> \textbf{w}^T  \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math>
  
  
Line 724: Line 940:
 
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math>
 
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math>
 
<br><br />
 
<br><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}
+
 
 
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br />
 
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br />
 
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math>
 
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math>
 
<br><br />
 
<br><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.
+
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.
 
 
  
 
D dimensional data will have D eigenvectors
 
D dimensional data will have D eigenvectors
Line 739: Line 954:
 
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math>
 
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math>
  
 +
If two eigenvalues happen to be equal, then the data has the same amount of variation in each of the two directions that they correspond to with. If only one of the two equal eigenvalues are to be chosen for dimensionality reduction, then either will do. Note that if ALL of the eigenvalues are the same then this means that the data is on the surface of a d-dimensional sphere (all directions have the same amount of variation).
  
 
Note that the Principal Components decompose the total variance in the data:
 
Note that the Principal Components decompose the total variance in the data:
 
<br /><br />
 
<br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math>
+
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math>
 
<br /><br />
 
<br /><br />
 
i.e. the sum of variations in all directions is the variation in the whole data
 
i.e. the sum of variations in all directions is the variation in the whole data
Line 751: Line 967:
  
 
The Matlab code is as follows:
 
The Matlab code is as follows:
 
{{Cleanup|date=October 2010|reason=I think as mentioned in the class this code does not perform PCA, since mean of the source vector has not been subtracted. This code mus be altered so as to implement real PCA}}
 
 
 
    
 
    
 
   load noisy
 
   load noisy
Line 761: Line 974:
 
   colormap gray
 
   colormap gray
 
   imagesc(reshape(X(:,1),20,28)')
 
   imagesc(reshape(X(:,1),20,28)')
   [u s v] = svd(X);
+
  m_X=mean(X,2);
 +
  mm=repmat(m_X,1,300);
 +
  XX=X-mm;
 +
   [u s v] = svd(XX);
 
   xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components
 
   xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components
 +
  xHat=xHat+mm;
 
   figure
 
   figure
 
   imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.
 
   imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.
Line 779: Line 996:
  
  
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.
+
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.
  
 
====Application of PCA - Feature Extraction ====
 
====Application of PCA - Feature Extraction ====
 +
PCA, depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).
 
One of the applications of PCA is to group similar data (e.g. images).  There are generally two methods to do this.  We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).
 
One of the applications of PCA is to group similar data (e.g. images).  There are generally two methods to do this.  We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).
  
Line 788: Line 1,006:
 
====General PCA Algorithm====
 
====General PCA Algorithm====
  
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).
+
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).
  
 
====Algorithm ====
 
====Algorithm ====
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{T}= x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.
+
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.
  
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times t</math> matrix of encoding of the original data.
+
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.
  
 
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.
 
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.
Line 804: Line 1,022:
 
Other Notes:
 
Other Notes:
 
::#The mean of the data(X) must be 0.  This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)
 
::#The mean of the data(X) must be 0.  This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>.  
+
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>.  
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>.  
+
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.
 
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.
 
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.
  
==Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem==
 
  
 +
==== Feature Extraction Uses and Discussion ====
 +
 +
PCA, as well as other feature extraction methods not within the scope of the course [http://en.wikipedia.org/wiki/Feature_extraction] are used as a first step to classification in enhancing generalization capability: one of the classification aspects that will be discussed later in the course is model complexity. As a classification model becomes more complex over its training set, classification error over test data tends to increase. By performing feature extraction prior to attempting classification, we restrict model inputs to only the most important variables, thus decreasing complexity and potentially improving test results.
  
===Lecture Summary===
+
Feature ''selection'' methods, that are used to select subsets of relevant features for building robust learning models, differ from extraction methods, where features are transformed. Feature selection has the added benefit of improving model interpretability.
This lecture introduces Fisher's linear discrimination analysis (FDA), which is a supervised dimensionality reduction method. FDA does not assume any distribution of the data and it works by reducing the dimensionality of the data by projecting the data on a line. That is, given d-dimensional data FDA project it to one-dimensional representation by <math>z = \underline{w}^T \underline{x} </math> where <math>x \in \mathbb{R}^{d}</math> and <math> \underline{w} =  \begin{bmatrix}w_1 \\ \vdots \\w_d \end{bmatrix} _{d \times 1}</math><br />
 
FDA derives a set of feature vectors by which high-dimensional data can be projected onto a low-dimensional feature space in the sense of maximizing class separability. Furthermore, the lecture clarifies a set of FDA basic concepts like Fisher’s ratio, ratio of between-class scatter matrix to within-class scatter matrix. It also discusses the goals specified by Fisher for his analysis then proceeding by mathematical formulation of these goals.
 
  
===Sir Ronald A. Fisher===
 
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant  here].
 
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].
 
  
=== Contrasting FDA with PCA ===
 
Similar to PCA, the goal of FDA is to project the data in a lower dimension.  The difference is that we are not interested in maximizing variances.  Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). 
 
Roughly speaking suppose we have 2-dimensional data, our goal is that we project the data of each class in to a point and then make those twp points as far as possible or more mathematically project the data on a line that classifies the data at two sides of a point on the line. If we can do this procedure then every simple classifiers can be used for classification. FDA has been proposed to do this task for our data.
 
The number of dimensions that we want to reduce the data to, depends on the number of classes:
 
<br>
 
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> 
 
<br>
 
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math>
 
  
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).
+
=== Independent Component Analysis ===
 +
As we have already seen, the Principal Component Analysis (PCA) performed by the Karhunen-Lokve transform produces features <math>\ y ( i ) ; i = 0, 1, . . . , N - 1</math>, that are mutually uncorrelated. The obtained by the KL transform solution is optimal when dimensionality reduction is the goal and one wishes to minimize the approximation mean square error. However, for certain applications, the obtained solution falls short of the expectations. In contrast, the more recently developed Independent Component Analysis (ICA) theory, tries to achieve much more than simple decorrelation of the data. The ICA task is casted as follows: Given the set of input samples <math>\ x</math>, determine an <math>\ N \times N</math> invertible matrix <math>\ W</math> such that the entries <math>\ y(i), i = 0, 1, . . . , N - 1</math>, of the transformed vector
 +
 
 +
<math>\ y = W.x</math>
  
The following diagram summarizes this goal.
+
are mutually independent. The goal of statistical independence is a stronger condition than the uncorrelatedness required by the PCA. The two conditions are equivalent only for Gaussian random variables. Searching for independent rather than uncorrelated features gives us the means of exploiting a lot more of information, hidden in the higher order statistics of the data.
  
[[File:FDA.JPG]]
+
This topic has brought to you from Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas. (Chapter 6) For further details on the ICA and its varieties, refer to this book.
  
In fact, the two examples above may represent the same data projected on two different lines.
+
=== References ===
 +
1. Probabilistic Principal Component Analysis
 +
[http://onlinelibrary.wiley.com/doi/10.1111/1467-9868.00196/abstract]
  
[[File:FDAtwo.PNG]]
+
2. Nonlinear Component Analysis as a Kernel Eigenvalue Problem
 +
[http://www.mitpressjournals.org/doi/abs/10.1162/089976698300017467]
  
=== Distance Metric Learning VS FDA ===
+
3. Kernel principal component analysis
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs,followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.
+
[http://www.springerlink.com/content/w0t1756772h41872/]
  
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}
+
4. Principal Component Analysis
 +
[http://onlinelibrary.wiley.com/doi/10.1002/0470013192.bsa501/full] and [http://support.sas.com/publishing/pubcat/chaps/55129.pdf]
  
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.
+
=== Further Readings ===
 +
1. I. T. Jolliffe "Principal component analysis" Available [http://books.google.ca/books?id=_olByCrhjwIC&printsec=frontcover&dq=principal+component+analysis&hl=en&ei=TooCTaesN42YnweR843lDQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CC4Q6AEwAA#v=onepage&q&f=false here].
  
===FDA Goals===
+
2. James V. Stone "Independent component analysis: a tutorial introduction" Available [http://books.google.ca/books?id=P0rROE-WFCwC&pg=PA129&dq=principal+component+analysis&hl=en&ei=TooCTaesN42YnweR843lDQ&sa=X&oi=book_result&ct=result&resnum=7&ved=0CEYQ6AEwBg#v=onepage&q=principal%20component%20analysis&f=false here].
{{Cleanup|date=October 2010|reason=It would be nice to see how did Fisher arrived at these two goals}}
 
Fisher has defined two goals by which the quality of discrimination is maximized.
 
The goals of FDA are reducing the dimensionality of data in order to have labeled separable data points in a 1D subspace orthogonal to the data (selected feature). We can consider two kinds of problems:
 
  
1. Two-class problem
+
3. Aapo Hyvärinen, Juha Karhunen, Erkki Oja "Independent component analysis" Available [http://books.google.ca/books?id=96D0ypDwAkkC&printsec=frontcover&dq=independent+component+analysis&hl=en&ei=F4wCTZqjJY2RnAew6pnlDQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CCoQ6AEwAA#v=onepage&q&f=false here].
  
2. Multi-class problem (addressed next lecture)
+
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem  - October 5, 2010 ==
  
=== Two-class problem  ===
+
===Sir Ronald A. Fisher===
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.
+
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) in some sources, is a classical [http://en.wikipedia.org/wiki/Feature_extraction feature extraction] technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant  here].  
 +
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].
  
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,
+
===Introduction===
represent the mean and covariance of the 1st class, and  
+
'''Linear discriminant analysis''' ([http://en.wikipedia.org/wiki/Linear_discriminant_analysis LDA]) and the related '''Fisher's linear discriminant''' are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:
 
  
1.''To make the means of these two classes as far apart as possible''
+
LDA is also closely related to principal component analysis ([http://en.wikipedia.org/wiki/Principal_component_analysis PCA]) and [http://en.wikipedia.org/wiki/Factor_analysis factor analysis] in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below.  
 
  
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''
+
LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is '''discriminant correspondence analysis'''.
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>.  The second goal is to minimize the sum of these two covariances.
 
  
As is demonstrated below, both of these goals can be accomplished simultaneously.
+
=== Contrasting FDA with PCA ===
<br/>
+
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA.
<br/>
 
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br />  <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math>
 
  
 +
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible.  Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). 
 +
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction.
  
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <math>\ z_i </math> is a sclar
+
Please note dimention reduction in PCA is different from subspace cluster , see the details about the subspace cluser [http://en.wikipedia.org/wiki/Clustering_high-dimensional_data]
 +
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}
 +
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}
 +
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}
  
====1. Maximum Separation====
+
The number of dimensions that we want to reduce the data to depends on the number of classes:
<math>\displaystyle \min (w^T\sum_1w) </math>
+
<br>
 +
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> 
 +
<br>
 +
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math>
  
<math>\displaystyle \min (w^T\sum_2w) </math>
+
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class.  That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).
   
 
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math>
 
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively)  
 
  
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.
+
The following diagram summarizes this goal.
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math>
 
  
====2. Maximize the distance between the means of the projected data====
+
[[File:FDA.JPG]]
<br />
 
The optimization problem we want to solve is,
 
<br /><br />
 
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math>
 
<br /><br />
 
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\
 
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\
 
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math>
 
<br /><br />
 
which is a scalar. Therefore,
 
<br /><br />
 
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math>
 
  
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math>
+
In fact, the two examples above may represent the same data projected on two different lines.
<br /><br />
 
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math>
 
<br /><br />
 
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math>
 
<br /><br />
 
Thus our original problem equivalent can be written as,
 
<br /><br />
 
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math>
 
<br /><br />
 
For a two class problem the between class variance is,
 
<br /><br />
 
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>
 
<br /><br />
 
Then this problem can be rewritten as,
 
<br /><br />
 
<math>\displaystyle \max (w^Ts_Bw)</math>
 
  
{{Cleanup|date=October 2010|reason=This section needs more explanations in using Lagrange multiplier and the way that we reach to the result through calculations  }}
+
[[File:FDAtwo.PNG]]
  
===Objective Function===
+
=== Distance Metric Learning VS FDA ===
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
+
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math>
 
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math>
 
<br /><br />
 
We take the ratio of the two -- we wish to maximize<br />
 
<br />
 
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math>
 
  
or equivalently,<br /><br />
+
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.
  
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math>
+
'''Example'''
  
but <math> \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math> is a matrix, and maximizing amtrix doesn't make sense. So we fix it like <math>\displaystyle \max \frac{Tr(w^Ts_Bw)}{Tr(w^Ts_ww)}</math>
+
In the paper "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", classification error rate for three of the six UCI datasets, each learned metric is projected onto a lowdimensional
 +
subspace, shown along the x axis are shown as below.
 +
:[[File:Example.png]],[[File:Example3.png]]
  
 +
===FDA Goals===
  
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.
+
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.
 +
==== Example in R ====
 +
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]
  
 +
>> X = matrix(nrow=400,ncol=2)
 +
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))
 +
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))
 +
>> Y = c(rep("red",200),rep("blue",200))
 +
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>.  Create <code>Y</code>, an index indicating which class they belong to.
  
This is a very famous problem which is called "the generalized eigenvector problem".  We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W.  Therefore we solve a problem similar to that in PCA,
+
  >> s <- svd(X,nu=1,nv=1)
<br /><br />
+
: Calculate the singular value decomposition of X.  The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.
<math>\displaystyle \max (w^Ts_Bw)</math> <br />
 
subject to <math>\displaystyle (w^Ts_Ww=1)</math>              (In the general optimization form, 1 is replaced with constant b)
 
<br /><br />
 
  
where <math>s_B</math> is the covariance matrix between classes and <math>s_w</math> is the covariance matrix within classes.
+
>> s2 <- lda(X,grouping=Y)
 +
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction.  This can be found in <code>s2$scaling</code>.
  
We solve the following Lagrange Multiplier problem,
+
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms.  FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.
<br /><br />
+
>> plot(X,col=Y,main="PCA vs. FDA example")
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br />
+
: Plot the set of points, according to colours given in Y.
 +
>> slope = s$v[2]/s$v[1]
 +
>> intercept = mean(X[,2])-slope*mean(X[,1])
 +
>> abline(a=intercept,b=slope)
 +
: Plot the main PCA direction, drawn through the mean of the dataset.  Only the direction is significant.
 +
>> slope2 = s2$scaling[2]/s2$scaling[1]
 +
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])
 +
>> abline(a=intercept2,b=slope2,col="red")
 +
: Plot the FLDA direction, again through the mean.
 +
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)
 +
: Labeling the lines directly on the graph makes it easier to interpret.
  
So, we have a Partial solution to:
 
<math>\displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1] </math>
 
  
- The optimal solution for w is the eigenvector of  
+
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:
<math>\displaystyle s_w^{-1}s_B </math>
 
corresponding to the largest eigenvalue;
 
  
{{Cleanup|date=October2010|reason=is it not that the K class problem is the multi class problem? If so, the solution would be totally different}}
+
1. Two-class problem
 
 
{{Cleanup|date=October2010|reason=In this part of the lecture FDA for 2 classes is described, however you can find the discribtion for k classes in the next pages which is reffered to as FDA for multi class problems. The equations here are correct for two class case and you can find for multi class in the following pages, in multiclass case since W is not a vector anymore (it is a matrix) there fore instead of max (W<sup>T</sup> S<sub>B</sub> W / W<sup>T</sup> S<sub>W</sub> W) , it should be written as max (Tr (W<sup>T</sup> S<sub>B</sub> W)/ Tr (W<sup>T</sup> S<sub>W</sub> W)).}}
 
  
 +
2. Multi-class problem (addressed next lecture)
  
 +
=== Two-class problem  ===
 +
In the two-class problem, we have prior knowledge that the data points belong to two classes. Conceptually, points of each class form a cloud around the class mean, and each class has an distinct size. To divide points among the two classes, we must determine the class whose mean is closest to each point, and we must also account for the different size of each class given by the covariance of each class.
  
 +
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,
 +
represent the mean and covariance of the 1st class, and
 +
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:
 +
 +
1.''To make the means of these two classes as far apart as possible''
 +
:In other words, the goal is to maximize the distance after projection between class 1 and class 2.  This can be done by maximizing the distance between the means of the classes after projection.  When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection.  If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively.  The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below.
  
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues.   
+
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''
 +
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>.  The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria).   
  
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:
+
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math>
 
  
===FDA Using Matlab===
+
{{Cleanup|date=December 2010|reason= If using the weighted sum of two covariances, you will need to use the shared mean of the two classes, and the weighted sum will be the shared covariance. Doing this will result in collapsing the two classes into one point, which contradicts the purpose of using FDA}}
Note: ''The following example was not actually mentioned in this lecture''
 
  
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods.
+
As is demonstrated below, both of these goals can be accomplished simultaneously.
      %First of all, we generate the two data set:
+
<br/>
      % First data set X1
+
<br/>
      X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);
+
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br />  <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math>
      %In this case:
 
      mu_1=[1;1];
 
      Sigma_1=[1 1.5; 1.5 3];
 
      %where mu and sigma are the mean and covariance matrix.
 
      % Second data set X2
 
      X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300);
 
      %Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]
 
      %The plot of the two distributions is:
 
      plot(X1(:,1),X1(:,2),'.b'); hold on;
 
      plot(X2(:,1),X2(:,2),'ob')
 
     
 
[[File:Mvrnd.jpg]]
 
  
      %We compute the principal components:
 
      % Combine data sets to map both into the same subspace
 
      X=[X1;X2];
 
      X=X';
 
      % We used built-in PCA function in Matlab
 
      [coefs, scores]=princomp(X);
 
 
 
      plot([0 coefs(1,1)], [0 coefs(2,1)],'b')
 
      plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')
 
      sw=2*[1 1.5;1.5 3]  % sw=Sigma1+Sigma2=2*Sigma1
 
      w=sw\[4; 2]      % calculate s_w^{-1}(mu2 - mu1)
 
      plot ([0 w(1)], [0 w(2)],'g')
 
  
[[File:Pca_full_1.jpg]]
+
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar
     
 
      %We now make the projection:
 
      Xf=w'*X
 
      figure
 
      plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"
 
      hold on
 
      plot(Xf(301:600),1,'or')
 
     
 
  
[[File:Fisher_no_overlap.jpg]]
+
====1. Minimizing within-class variance====
 +
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math>
  
      %We see that in the above picture that there is no overlapping
+
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math>
      Xp=coefs(:,1)'*X
+
   
      figure
+
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math>
      plot(Xp(1:300),1,'b')
+
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively)  
      hold on
 
      plot(Xp(301:600),2,'or')  
 
 
 
 
 
[[File:Pca_overlap.jpg]]
 
  
      %In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]
+
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.
 +
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.
  
===Some of FDA applications===
+
====2. Maximize the distance between the means of the projected data====
There are many applications for FDA in many domains some of them are stated below:
+
<br /><br />
 +
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math>
 +
<br /><br />
 +
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\
 +
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\
 +
&= (\mu_1 - \mu_2)^T \underline{w}  \underline{w}^T (\mu_1 - \mu_2) \\
  
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS
+
&= ((\mu_1 - \mu_2)^T \underline{w})^{T}  (\underline{w}^T (\mu_1 - \mu_2))^{T} \\
FDA can be used to enhance listening comprehension when the user goes from a sound
+
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]
 
  
* Application to Face Recognition
+
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face
 
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].
 
  
* Palmprint Recognition
+
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].
 
  
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}
+
===The Objective Function for FDA===
other applications could found in references 4,5,6,7,8 and more in  [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]
+
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br />
 +
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>
 +
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>
 +
<br /><br />
 +
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br />
 +
<br />
 +
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math>
  
=== '''References'''===
+
or equivalently,<br /><br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005
 
doi: 10.1109/CVPR.2005.30
 
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]
 
  
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]
+
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br />
 +
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.
  
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]
+
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:
  
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.
+
<br /><br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]
+
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br />
  
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"
+
<br />
Journal of Computers & Chemical Engineering, 2004
+
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]
+
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0  </math> <br />
  
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004
+
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]
+
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br />
 +
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br />
  
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of  Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]
+
This solution can be further simplified as follow:<br />
  
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]
+
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br />
  
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem==
+
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br />
 +
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.
  
 +
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.
  
====Lecture Summary====
+
===FDA Using Matlab===
 +
Note: ''The following example was not actually mentioned in this lecture''
  
This lecture describes a generalization of Fisher's discriminant analysis to more than 2 classes. For the multi-class, or <math>k</math>-class problem, we are trying to find a projection from a <math>d</math>-dimensional space to a <math> (k-1)</math>-dimensional space.  
+
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods.  
Recall that for the <math>2</math>-class problem, the objective function was <math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math> .
+
      %First of all, we generate the two data set:
In the <math>k</math>-class problem, <math>\mathbf{W}</math> is a <math>d\times (k-1)</math> transformation matrix, <math>\mathbf{W} =[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and the objective function becomes <math>\displaystyle \max \frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}</math>
+
      % First data set X1
 
+
      X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);
As in the <math>2</math>-class case, this is also a generalized eigenvalue problem, and the solution can be computed as the first <math>(k-1)</math> eigenvectors of <math>\mathbf{S}_{W}^{-1}\mathbf{S}_{B},</math>
+
      %In this case:
i.e. <math>\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =\lambda_{i}\mathbf{w}_{i}</math>.
+
      mu_1=[1;1];
 +
      Sigma_1=[1 1.5; 1.5 3];
 +
      %where mu and sigma are the mean and covariance matrix.
 +
      % Second data set X2
 +
      X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300);
 +
      %Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]
 +
      %The plot of the two distributions is:
 +
      plot(X1(:,1),X1(:,2),'.b'); hold on;
 +
      plot(X2(:,1),X2(:,2),'ob')
 +
     
 +
[[File:Mvrnd.jpg]]
  
====Obtaining Covariance Matrices====
+
      %We compute the principal components:
 +
      % Combine data sets to map both into the same subspace
 +
      X=[X1;X2];
 +
      X=X';
 +
      % We used built-in PCA function in Matlab
 +
      [coefs, scores]=princomp(X);
 +
 
 +
      plot([0 coefs(1,1)], [0 coefs(2,1)],'b')
 +
      plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')
 +
      sw=2*[1 1.5;1.5 3]  % sw=Sigma1+Sigma2=2*Sigma1
 +
      w=sw\[4; 2]      % calculate s_w^{-1}(mu2 - mu1)
 +
      plot ([0 w(1)], [0 w(2)],'g')
  
 +
[[File:Pca_full_1.jpg]]
 +
     
 +
      %We now make the projection:
 +
      Xf=w'*X
 +
      figure
 +
      plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"
 +
      hold on
 +
      plot(Xf(301:600),1,'or')
 +
     
  
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:
+
[[File:Fisher_no_overlap.jpg]]
:<math>
 
\begin{align}
 
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}
 
\end{align}
 
</math>
 
 
 
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:
 
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -
 
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:
 
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.
 
  
However, the between-class covariance matrix
+
      %We see that in the above picture that there is very little overlapping
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total variance is constant, and so we decompose the variance into two parts: within-class and between-class (similar to ANOVA). We have:
+
      Xp=coefs(:,1)'*X
 +
      figure
 +
      plot(Xp(1:300),1,'b')
 +
      hold on
 +
      plot(Xp(301:600),2,'or')
 +
 
 +
 
 +
[[File:Pca_overlap.jpg]]
  
:<math>
+
      %In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]
\begin{align}
 
\mathbf{S}_{T} = \mathbf{S}_{B} + \mathbf{S}_{W}
 
\end{align}
 
</math>
 
  
where the total variance is given by
+
===Some of FDA applications===
 +
There are many applications for FDA in many domains; a few examples are stated below:
 +
 
 +
* Speech/Music/Noise Classification in Hearing Aids
 +
FDA can be used to enhance listening comprehension when the user goes from one sound environment to another different one. In practice, many people who require hearing aids do not wear them due in part to the nusiance of having to adjust the settings each time a user changes noise environments (for example, from a quiet walk in the to park to a crowded cafe). If the hearing aid itself could distinguish between the type of sound environment and automatically adjust its settings itself, many more people may be willing to wear and use the hearing aids. The paper referenced below examines the difference in using a classifier based on one level and three classes ("speech", "noisy" or "music" environments) and a classifier based on two levels with two classes each ("speech" versus "non-speech" and then for the "non-speech" group, between "noisy" and "music") and also includes a discussion about the feasibility of implementing these classifiers in the hearing aids. For more information review this paper by Alexandre et al. [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here].
  
:<math>
+
* Application to Face Recognition
\begin{align}
+
FDA can be used in face recognition for different situations. Instead of using the one-dimensional LDA where the data is transformed into long column vectors with less-than-full-rank covariance matrices for the within-class and between-class covariance matrices, several other approaches of using FDA are suggested here including a two-dimensional approach where the data is stored as a matrix rather than a column vector. In this case, the covariance matrices are full-rank. Details can be found in the paper by Kong et al. [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].
\mathbf{S}_{T} =
 
\frac{1}{n}
 
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}
 
\end{align}
 
</math>
 
  
We can now get <math>\mathbf{S}_{B}</math> from the relationship:
+
* Palmprint Recognition
 +
FDA is used in biometrics to implement an automated palmprint recognition system. In Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here] An Automated Palmprint Recognition System was proposed and FDA was used to match images in a compressed subspace where these subspaces best discriminate among classes. It is different from PCA in the aspect that it deals directly with class separation while PCA treats images in its entirety without considering the underlying class structure.
  
:<math>
+
* Other Applications
\begin{align}
 
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}
 
\end{align}
 
</math>
 
  
{{Cleanup|date=October 2010|reason=Please check the derivation of decomposition of variance for errors. The total variance is missing a factor of 1/n, does this effect the formula for <math>\mathbf{S}_{B}</math>? }}
+
Other applications can be seen in [4] where FDA was used to authenticate different olive oil types, or classify multiple fault classes [5]. As well as, applications on face recognition [6] and shape deformations to localize epilepsy [8].
{{Cleanup|date=October 2010|reason=You will still get the right results even without adding 1/n. However, the classes will be as if mirrored }}
 
  
 +
=== '''References'''===
 +
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005
 +
doi: 10.1109/CVPR.2005.30
 +
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]
  
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a
+
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]
total mean vector <math>\mathbf{\mu}</math> by
 
  
:<math>
+
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]
\begin{align}
 
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =
 
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}
 
\end{align}
 
</math>
 
  
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is
+
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.
 +
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]
 +
 
 +
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"
 +
Journal of Computers & Chemical Engineering, 2004
 +
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]
 +
 
 +
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004
 +
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]
 +
 
 +
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of  Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]
 +
 
 +
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]
 +
 
 +
9. Fisher LDA and Kernel Fisher LDA [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf]
 +
 
 +
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem  - October 7, 2010==
 +
 
 +
===Obtaining Covariance Matrices===
 +
 
 +
 
 +
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\mathbf{S}_{T} =
+
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}
 
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Thus we obtain
+
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:
 +
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -
 +
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:
 +
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.
 +
 
 +
However, the between-class covariance matrix
 +
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:
 +
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -
+
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -
 
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T}
 
\\&
 
= \sum_{i=1}^{k}\sum_{j:
 
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+
 
\sum_{i=1}^{k}\sum_{j:
 
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
 
\\&
 
= \mathbf{S}_{W} + \sum_{i=1}^{k}
 
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
 
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math>
+
where the total variance is given by
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as
 
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain
 
  
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\mathbf{S}_{B} = \sum_{i=1}^{k}
+
\mathbf{S}_{T} =  
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
+
\frac{1}{n}
 +
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Therefore,
+
We can now get <math>\mathbf{S}_{B}</math> from the relationship:
 +
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}
+
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Recall that in the two class case problem, we have
+
 
 +
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by
 +
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
& \mathbf{S}_{B^{\ast}} =
+
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}
+
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}
\\ & =
+
\end{align}
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}
 
\\ & =
 
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}
 
\\ & =
 
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}
 
\end{align}
 
 
</math>
 
</math>
  
From the general form,
+
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
& \mathbf{S}_{B} =
+
\mathbf{S}_{T} =
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}
+
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}
+
 
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}
 
 
\end{align}
 
\end{align}
 
</math>
 
</math>
Apparently, they are very similar.
 
  
{{Cleanup|date=October 2010|reason=Please confirm that the algebra for the calculation of <math>\mathbf{S}_{B}^{\ast}</math> and <math>\mathbf{S}_{W}^{\ast}</math> is correct}}
+
Thus we obtain
{{Cleanup|date=October 2010|reason=I Think it is not correct here are two terms missing with no explanations of why!!!!}}
 
 
 
Now, we are trying to find the optimal transformation. Basically, we have
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},
+
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -
i=1,2,...,k-1
+
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -
 +
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T}  
 +
\\&
 +
= \sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+
 +
\sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
 +
\\&
 +
= \mathbf{S}_{W} + \sum_{i=1}^{k}
 +
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math>
+
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math>
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =
+
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math>
+
 
is a <math>d\times 1</math> column vector.
+
:<math>
 
 
Thus we obtain
 
:<math>
 
 
\begin{align}
 
\begin{align}
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:
+
\mathbf{S}_{B} = \sum_{i=1}^{k}
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}
+
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}
\\ & = \sum_{i=1}^{k}\sum_{j:
 
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}
 
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:
 
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}
 
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}
 
 
\end{align}
 
\end{align}
 
</math>
 
</math>
Similarly, we obtain
+
 
 +
 
 +
Recall that in the two class case problem, we have
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
& \mathbf{S}_{B}^{\ast} =
+
& \mathbf{S}_{B}^* =
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}
+
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}
 +
\\ & =
 +
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}
 +
\\ & =
 +
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}
 +
\\ & =
 +
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})
 
\\ & =
 
\\ & =
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}
+
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}
\\ & = \mathbf{W}^{T}\left[
 
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}
 
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}
 
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Now, we use the determinant of the matrix, i.e. the product of the
 
eigenvalues of the matrix, as our measure.
 
  
{{Cleanup|date=September 2010|reason=There is no justification for using determinant. Moreover there is inconsistency here. Should we use Trace (as suggested below) or Determinant (as suggested here) }}
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\phi(\mathbf{W}) =
+
& \mathbf{S}_{B} =
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =
+
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}
\frac{|\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}|}{|\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}|}
+
+
 +
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
 +
Apparently, they are very similar.
  
The solution for this question is that the columns of the transformation matrix
+
Now, we are trying to find the optimal transformation. Basically, we have
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math>
 
eigenvalues with respect to
 
 
 
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}
 
 
 
 
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =
+
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},
\lambda_{i}\mathbf{w}_{i}
+
i=1,2,...,k-1
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Also, note that we can use
+
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math>
 +
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =
 +
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math>
 +
is a <math>d\times 1</math> column vector.
 +
 
 +
Thus we obtain
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}
+
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}
 +
\\ & = \sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}
 +
\\ & = \sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})
 +
\\ & = \sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}
 +
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:
 +
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}
 +
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
as our measure.
+
Similarly, we obtain
 +
:<math>
 +
\begin{align}
 +
&  \mathbf{S}_{B}^{\ast} =
 +
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}
 +
\\ & =
 +
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}
 +
\\ & = \mathbf{W}^{T}\left[
 +
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}
 +
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}
 +
\end{align}
 +
</math>
 +
 
 +
Now, we use the following as our measure:
 +
:<math>
 +
\begin{align}
 +
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}
 +
\end{align}
 +
</math>
 +
 
 +
The solution for this question is that the columns of the transformation matrix
 +
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math>
 +
eigenvalues with respect to
  
{{Cleanup|date=October 2010|reason=Please confirm that the identity below is true. Isn't the Euclidean matrix norm the square root of the larget eigenvalue of <math>X^*X</math>? Yes this is true the Euclidean matrix norm is the largest singular value of X. The square root of the trace of <math>X^*X</math> is the Frobenius Norm. So really it should as follows.}}
+
{{Cleanup|reason=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}
 +
{{Cleanup|date=December 2010|reason=Covariance matrices are positive semi-definite. The inverse of a positive semi-definite matrix is positive semi-definite.  The product of positive semi-definite matrices is positive semi-definite. The eigenvalues of a positive semi-definite matrix are all real, non-negative values. As a result, the eigenvalues of \mathbf{S}_{W}^{-1}\mathbf{S}_{B} will always be real, non-negative values.}}
  
Recall that
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
\|\mathbf{X}\|^2_{F} = Tr(\mathbf{X}^{T}\mathbf{X})
+
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =
 +
\lambda_{i}\mathbf{w}_{i}
 
\end{align}
 
\end{align}
 
</math>
 
</math>
  
Thus we obtain that
+
 
 +
Recall that the Frobenius norm of <math>X</math> is
 +
:<math>
 +
\begin{align}
 +
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})
 +
\end{align}
 +
</math>
 +
 
 +
 
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
Line 1,304: Line 1,544:
 
</math>
 
</math>
  
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following classic criterion function that Fisher used
+
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used
 
:<math>
 
:<math>
 
\begin{align}
 
\begin{align}
Line 1,359: Line 1,599:
  
 
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix
 
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math>
+
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math>
 
eigenvalues with respect to
 
eigenvalues with respect to
 
:<math>
 
:<math>
Line 1,371: Line 1,611:
  
 
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}
 
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}
 +
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data  }}
  
===Generalization of Fisher's Linear Discriminant Analysis ===
+
====Advantages of FDA compared with PCA====
 +
 
 +
-PCA find components which are useful for representing data.
 +
 
 +
-While there is no reason to assume that components are useful to discriminate data between classes.
 +
 
 +
-In FDA , we try to use labels to find the components which are useful for discriminating data.
 +
 
 +
===Generalization of Fisher's Linear Discriminant Analysis ===
  
 
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity
 
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]]is developed to lead easily to a very robust procedure.
+
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.
 +
 
 +
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.
  
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimension.
+
===Multiple Discriminant Analysis===
  
 +
(MDA) is also termed Discriminant Factor Analysis and Canonical Discriminant Analysis. It adopts a similar perspective to PCA: the rows of the data matrix to be examined constitute points in a multidimensional space, as also do the group mean vectors. Discriminating axes are determined in this space, in such a way that optimal separation of the predefined groups is attained. As with PCA, the problem becomes mathematically the eigenreduction of a real, symmetric matrix. The eigenvalues represent the discriminating power of the associated eigenvectors. The nYgroups lie in a space of dimension at most <math>n_{y-1}</math>. This will be the number of discriminant axes or factors obtainable in the most common practical case when n > m > nY (where n is the number of rows, and m the number of columns of the input data matrix.
  
==Linear and Logistic Regression - October 12, 2010==
+
===Matlab Example: Multiple Discriminant Analysis for Face Recognition===
  
 +
% The following MATLAB code is an example of using MDA in face recognition. The used dataset can be    % found be found [http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html here]. IT contains % a set of face images taken between April 1992 and April 1994 at the lab. The database was used in the % context of a face recognition project carried out in collaboration with the Speech, Vision and      % Robotics Group of the Cambridge University Engineering Department.
  
===Lecture Summary===
+
load orl_faces_112x92.mat
In this Lecture, Prof Ali Ghodsi reviews the LDA as a dimensionality reduction method and introduces 2 models for regression, linear and logistic regression.
+
u=(mean(faces'))';
 +
stfaces=faces-u*ones(1,400);
 +
S=stfaces'*stfaces;
 +
[V,E] = eig(S);
 +
U=zeros(length(stfaces),150);%%%%%%
 +
for i=400:-1:251
 +
    U(:,401-i)=stfaces*V(:,i)/sqrt(E(i,i));
 +
end
 +
 +
defaces=U'*stfaces;
 +
for i=1:40
 +
    for j=1:5
 +
        lsamfaces(:,j+5*i-5)=defaces(:,j+10*i-10);
 +
        ltesfaces(:,j+5*i-5)=defaces(:,j+10*i-5);
 +
    end
 +
end
 +
stlsamfaces=lsamfaces-lsamfaces*wdiag(ones(5,5),40)/5;
 +
Sw=stlsamfaces*stlsamfaces';
 +
zstlsamfaces=lsamfaces-(mean(lsamfaces'))'*ones(1,200);
 +
St=zstlsamfaces*zstlsamfaces';
 +
Sb=St-Sw;
 +
[V D]=eig(Sw\Sb);
 +
U=V(:,1:39);
 +
desamfaces=U'*lsamfaces;
 +
detesfaces=U'*ltesfaces;
 +
rightnum=0;
 +
for i=1:200
 +
    mindis=10^10;minplace=1;
 +
    for j=1:200
 +
        distan=norm(desamfaces(:,i)-detesfaces(:,j));
 +
        if mindis>distan
 +
            mindis=distan;
 +
            minplace=j;
 +
        end
 +
    end
 +
    if floor(minplace/5-0.2)==floor(i/5-0.2)
 +
        rightnum=rightnum+1;
 +
    end
 +
end
 +
rightrate=rightnum/200
  
http://en.wikipedia.org/wiki/Regression_analysis Regression analysis] is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.
+
===K-NNs Discriminant Analysis===
  
General information on [http://en.wikipedia.org/wiki/Linear_regression linear regression] can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].
+
Non-parametric (distribution-free) methods dispense with the need for assumptions regarding the probability density function. They have become very popular especially in the image processing area. The K-NNs method assigns an object of unknown affiliation to the group to which the majority of its K nearest neighbours belongs.
  
===Linear Regression===
+
There is no best discrimination method. A few remarks concerning the advantages and disadvantages of the methods studied are as follows.
We will start by considering a very simple regression model, the linear regression model.
 
According to Bayes Classification, <br/>
 
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/>
 
  
For the purpose of classification, the linear regression model assumes
+
:1.Analytical simplicity or computational reasons may lead to initial consideration of linear discriminant analysis or the NN-rule.
that the regression function <math>\,E(Y|X)</math> is linear in the inputs
+
:2.Linear discrimination is the most widely used in practice. Often the 2-group method is used repeatedly for the analysis of pairs of multigroup data (yielding <math>\frac{k(k-1)}{2}</math>decision surfaces for k groups).
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.
+
:3.To estimate the parameters required in quadratic discrimination more computation and data is required than in the case of linear discrimination. If there is not a great difference in the group covariance matrices, then the latter will perform as well as quadratic discrimination.
 +
:4.The k-NN rule is simply defined and implemented, especially if there is insufficient data to adequately define sample means and covariance matrices.
 +
:5.MDA is most appropriately used for feature selection. As in  the case of PCA, we may want to focus on the variables used in order to investigate the differences between groups; to create synthetic variables which improve the grouping ability of the data; to arrive at a similar objective by discarding irrelevant variables; or to determine the most parsimonious variables for graphical representational purposes.
  
The simple linear regression model has the general form:
+
=== Fisher Score ===
 +
Fisher Discriminant Analysis should be distinguished from Fisher Score. Feature score is a means, by which we can evaluate the importance of each of the features in a binary classification task. Here is the Fisher score, or in brief <math>\ FS</math>.
  
:<math>
+
<math>FS_i=\frac{(\mu_i^1-\mu_i)^2+(\mu_i^2-\mu_i)^2}{var_i^1+var_i^2}</math>
\begin{align}
 
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}
 
\end{align}
 
</math>
 
and we can denote it as
 
:<math>
 
\begin{align}
 
\mathbf{y} = \beta^{T}\mathbf{X}
 
\end{align}
 
</math>
 
where <math>\,\beta^{T} = (
 
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=
 
\begin{pmatrix}
 
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\
 
1, \dots, 1
 
\end{pmatrix}
 
</math> is a <math>(d+1) \times n</math> Matrix,here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector
 
  
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].
+
Where <math>\ \mu_i^1</math>, and <math>\ \mu_i^2</math> are the average of the feature <math>\ i</math> for the class 1 and 2 respectively and <math>\ \mu_i</math> is the average of the feature <math>\ i</math> over both of the classes. And <math>\ var_i^1</math>, and <math>\ var_i^2</math> are the variances of the feature <math>\ i</math> in the two classes of 1 and 2 respectively.
  
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,
+
We can estimate the FS over all of the features and then select those features with the highest FS. We want features to discriminate as much as possible between two classes and describe each of the classes as dense as possible; this is exactly the criterion that has been taken into consideration for defining the Fisher Score.
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin
 
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or
 
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.
 
  
We then try to minimize the residual sum-of-squares
 
  
:<math>
+
===References===
\begin{align}
 
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})(\mathbf{y}-\beta^{T}\mathbf{X})^{T}
 
\end{align}
 
</math>
 
  
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating
+
1. Optimal Fisher discriminant analysis using the rank decomposition
with respect to <math>\,\beta</math> we obtain
+
[http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V14-48MPMK5-14R&_user=10&_coverDate=01%2F31%2F1992&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1550315473&_rerunOrigin=scholar.google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=b8b00da9ab59b76a40eca456f5aa99b6&searchtype=a]
:<math>
 
\begin{align}
 
\frac{\partial \mathrm{RSS}}{\partial \beta} =
 
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}
 
\end{align}
 
</math>
 
  
:<math>
+
2. Face recognition using Kernel-based Fisher Discriminant Analysis
\begin{align}
+
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1004157]
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial
 
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}
 
\end{align}
 
</math>
 
  
Set the first derivative to zero
+
3. Fisher discriminant analysis with kernels
:<math>
+
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=788121]
\begin{align}
 
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0
 
\end{align}
 
</math>
 
  
we obtain the solution
+
4. Fisher LDA and Kernel Fisher LDA [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf]
:<math>
 
\begin{align}
 
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}
 
\end{align}
 
</math>
 
  
{{Cleanup|date=12 Oct 2010|reason=we use :<math>\begin{align}
+
5. Previous STAT 841 notes. [http://www.math.uwaterloo.ca/~aghodsib/courses/f07stat841/notes/lecture7.pdf]
\mathbf{y} = \beta^{T}\mathbf{X}
 
\end{align}</math> in this course, but
 
:<math>\begin{align}
 
\mathbf{y} = \mathbf{X}\beta
 
\end{align}</math> were used by the notes last year and then it has the result below}
 
  
Thus the fitted values at the inputs are
+
6. Another useful pdf introducing FDA [http://www.cedar.buffalo.edu/~srihari/CSE555/Chap3.Part6.pdf]
:<math>
 
\begin{align}
 
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}
 
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}
 
\end{align}
 
</math>
 
  
where <math>\mathbf{H} = \mathbf{X}
+
==Random Projection==
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].
+
Random Project (RP) is an approach of projecting a point from a high dimensional space to a lower dimensional space. In general, a target subspace, presented as a uniform random orthogonal matrix, should be determined firstly and the projected vector can be described as v=c.p.u, where u is a d-dimension vector, p is the uniform random orthogonal matrix with d’ rows and d columns, v is the projected vector with d’-dimension and c is scaling factor such that the expected squared length of v is equal to the squared length of u. For the projected vectors by RP, they have two main properties:
 +
1. The distance between any two of the original vectors is approximately equal to the distance of their corresponding projected vectors by RP.
 +
2. If each of entries in the uniform random orthogonal matrix is randomly selected followed by distribution N(0,1), then the expected squared length of v is equal to the squared length of u.
 +
For more details of RP, please see The Random Projection Method by Santosh S. Vempala.
  
<br/>
 
*'''Note'''  For classification purposes, this is not a correct model.  Recall the following application of Bayes classifier:<br/>
 
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/>
 
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1.  If this is estimated with the
 
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.
 
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>
 
This model does not classify Y between 0 and 1, so it is not good and sometimes it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math>  <math>\ \frac{-1}{n_2} </math>
 
  
===Logistic Regression===
+
==Linear and Logistic Regression - October 12, 2010==
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>.  Since  <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.
 
  
==== logistic function ====
+
===Linear Regression===
 +
Linear regression is an approach for modeling the response variable <math>\, y</math> under the assumption that <math>\, y</math> is  a [http://en.wikipedia.org/wiki/Linear_function linear function] of a set of [http://en.wikipedia.org/wiki/Regressor explanatory variables] <math>\,X</math>. Any observed deviation from this assumed linear relationship between <math>\, y</math> and <math>\,X</math> is attributed to an unobserved [http://en.wikipedia.org/wiki/Random_variable random variable] <math>\, \epsilon</math> that adds random noise.
  
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve.  
+
In linear regression, the goal is use a set of training data <math>\{y_i,\, x_{i1}, \ldots, x_{id}\}, i=1, \ldots, n</math> to find a linear combination <math>\,\beta^T = \begin{pmatrix}\beta_1 & \cdots & \beta_d & \beta_0\end{pmatrix}</math> that best explains the variation in <math>\, y</math>. In <math>\,\beta</math>, <math>\,\beta_0</math> is the intercept of the fitted line that approximates the assumed linear relationship between <math>\, y</math> and <math>\,X</math>. <math>\,\beta_0</math> enables this fitted line to be situated away from the origin. In classification, the goal is to classify data into groups so that group members are more similar within groups than between groups.  
  
:<math>y = \frac{1}{1+e^{-x}}</math>
+
If the data is 2-dimensional, a model of <math>\, y</math> as a function of <math>\,X</math> constructed using training data under the assumption of linear regression typically looks like the one in the following figure:
  
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math>
+
[[File: Linear_regression.png]]
  
2. <math>y(0) = \frac{1}{2}</math>
+
The linear regression model is a very simple regression model.
 +
According to Bayes Classification we estimate the posterior probability as<br/>
 +
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/>
  
3. <math> \int y dx = ln(1 + e^{x})</math>
+
For the purpose of classification, the linear regression model assumes
 +
that the regression function <math>\,E(Y|X)</math> is a linear combination of the inputs
 +
<math>\,X</math>.
  
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math>
+
That is, the full model under linear regression has the general form
  
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.
+
:<math>
 +
\begin{align}
 +
y_i = \beta_1 x_{i1} + \cdots + \beta_d x_{id} + \beta_0 + \varepsilon_i
 +
= \beta^T x_i + \varepsilon_i,
 +
\qquad i = 1, \ldots, n,
 +
\end{align}
 +
</math>
 +
and the fitted model that can be used to estimate the response <math>\, y</math> of any new data point has the form
 +
:<math>
 +
\begin{align}
 +
\hat y_i = \beta_1 x_{i1} + \cdots + \beta_d x_{id} + \beta_0
 +
= \beta^T x_i,
 +
\qquad i = 1, \ldots, n.
 +
\end{align}
 +
</math>.
  
====Intuition behind Logistic Regression====
+
In matrix form, the full model can be expressed as
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):
+
:<math>
 +
\begin{align}
 +
\mathbf{y} = \mathbf{X}^T \beta + \varepsilon
 +
\end{align}
 +
</math>
 +
and the fitted model can be expressed as
 +
:<math>
 +
\begin{align}
 +
\hat \mathbf{y} = \mathbf{X}^T \beta
 +
\end{align}
 +
</math>
  
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math>
+
Here, <math>\,\beta^T = \begin{pmatrix}\beta_1 & \cdots & \beta_d & \beta_0\end{pmatrix}</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=
 +
\begin{pmatrix}
 +
\mathbf{x}_{1} \cdots \mathbf{x}_{n}\\
 +
1 \cdots 1
 +
\end{pmatrix}
 +
</math> is a <math>(d+1) \times n</math> matrix. Here, <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector.
  
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].
+
Given the input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math>, our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].
 +
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,
 +
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin
 +
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or
 +
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.
  
====The Logistic Regression Model====
+
To determine the values for <math>\,\beta^{T}</math>, we minimize the residual sum-of-squares
The logistic regression model for the two class case is defined as
 
  
'''Class 1'''
+
:<math>
[[File:Picture1.png‎|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]
+
\begin{align}
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math>  
+
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}^T \beta)(\mathbf{y}-\mathbf{X}^T \beta)^{T}
 +
\end{align}
 +
</math>
  
 +
This is a quadratic function in <math>\,d+1</math> parameters. The parameters that minimize the RSS can be determined by differentiating with respect to <math>\,\beta</math>.  We then obtain
  
Then we have that
+
:<math>
 +
\begin{align}
 +
\frac{\partial \mathrm{RSS}}{\partial \beta} =
 +
-2\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^T \beta)^{T}
 +
\end{align}
 +
</math>
  
'''Class 0'''
+
:<math>
[[File:Picture2.png‎ |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]
+
\begin{align}
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math>
+
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial
 +
\beta^{T}}=2\mathbf{X}\mathbf{X}^{T}
 +
\end{align}
 +
</math>
  
====Fitting a Logistic Regression====
+
Setting the first derivative to zero,
Logistic regression tries to fit a distribution.  The fitting of logistic regression models is usually accomplished by [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood], using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution.  Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:
+
:<math>
 +
\begin{align}
 +
\mathbf{X}(\mathbf{y}-\mathbf{X}^{T}\hat{\beta})=0
 +
\end{align}
 +
</math>
  
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math>
+
we obtain the solution
 +
:<math>
 +
\begin{align}
 +
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}
 +
\end{align}
 +
</math>
 +
Thus the fitted values at the inputs are
 +
:<math>
 +
\begin{align}
 +
\mathbf{\hat y} = \mathbf{X}^{T}\hat{\beta} =
 +
\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y} =
 +
\mathbf{H}\mathbf{y}
 +
\end{align}
 +
</math>
  
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is
+
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].
  
 +
A more efficient way to do this is by [http://en.wikipedia.org/wiki/QR_decomposition QR Factorization]
 +
 +
<math>
 +
X^T = QR </math> where Q is an orthonormal matrix and R is an upper triangular matrix
 +
 +
<math>
 +
\begin{align}
 +
\hat{\beta} &=& ((QR){^T}(QR))^{-1}(QR)^{T}y \\
 +
  &=& ((R^{T}Q^{T}QR))^{-1}(QR)^{T}y \\
 +
  &=& (R^{T}R)^{-1}R^{T}Qy \\
 +
  &=& R^{-1}(R^{-T}R^{T})Qy \\
 +
  &=& R^{-1}Qy
 +
\end{align}
 +
</math>
 +
 +
Therefore <math>\hat{\beta}</math> can be solved for by solving <math> R\hat{\beta} = Qy</math>
 +
 +
<br/>
 +
*'''Note'''  For classification purposes, this is not a correct model.  Recall the following application of Bayes classifier:<br/>
 +
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/>
 +
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1 and must also sum up to 1.  If this is estimated with the
 +
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to meet these two criteria. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.
 +
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>.
 +
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math>  <math>\ \frac{-1}{n_2} </math>
 +
[[File:Example.jpg]]
 +
 +
==== Recursive Linear Regression ====
 +
In some applications, we need to estimate the weights of the linear regression in an online scheme. For real-time applications, efficiency of computations comes to be very important. In cases like this, we have a batch data and more data samples are still being observed and according to the whole observed data points, we need for example to predict the class label of the upcoming samples. To be able to do that in real-time we should better take advantage of the computations that we have done up to any given sample point and estimate the new weights -having seen the new sample point- using the previous weights -before observing the new sample point. So we want to update the weights, like this:
 +
 +
<math>\ W_{new}=h(W_{old},x_{new},y_{new})</math>
 +
 +
In which <math>\ W_{new}</math> and <math>\ W_{old}</math> are the linear regression weights after and before observation of the new sample pair, <math>\ (x_{new},y_{new})</math>. The function <math>\ h</math> could be obtained using the following procedure.
 +
 +
<math>\begin{align}
 +
W_{old}&=(XX^T-x_{new}x_{new}^T)^{-1}(XY-x_{new}y_{new}) \\
 +
\rightarrow (XX^T-x_{new}x_{new}^T)W_{old}&=XY-x_{new}y_{new}  \\
 +
\rightarrow XX^TW_{old}&=XY-x_{new}y_{new}+x_{new}x_{new}^TW_{old}  \\
 +
\rightarrow W_{old}&=(XX^T)^{-1}(XY-x_{new}y_{new}+x_{new}x_{new}^TW_{old})  \\
 +
\rightarrow W_{old}&=W_{new}+(XX^T)^{-1}(-x_{new}y_{new}+x_{new}x_{new}^TW_{old}) \\
 +
\rightarrow W_{new}&=W_{old}-(XX^T)^{-1}(-x_{new}y_{new}+x_{new}x_{new}^TW_{old})
 +
\end{align}</math>
 +
 +
Where <math>\ X</math>, and <math>\ Y</math> represent the whole set of sample points pairs, including the recently seen sample pair, <math>\ (x_{new},y_{new})</math>.
 +
 +
====Comments about Linear regression model====
 +
 +
Linear regression model is almost the easiest and most popular way to analyze the relationship of different data sets. However, it has some disadvantages as well as its advantages. We should be clear about them before we apply the model.
 +
 +
''Advantages'': Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. Though there are types of data that are better described by functions that are nonlinear in the parameters, many processes in science and engineering are well-described by linear models. This is because either the processes are inherently linear or because, over short ranges, any process can be well-approximated by a linear model. The estimates of the unknown parameters obtained from linear least squares regression are the optimal estimates from a broad class of possible parameter estimates under the usual assumptions used for process modeling. Practically speaking, linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets. Finally, the theory associated with linear regression is well-understood and allows for construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations. These statistical intervals can then be used to give clear answers to scientific and engineering questions.
 +
 +
''Disadvantages'': The main disadvantages of linear least squares are limitations in the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers. Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, the output of the linear model will also always more extreme. This means that linear models may not be effective for extrapolating the results of a process for which data cannot be collected in the region of interest. Of course extrapolation is potentially dangerous regardless of the model type. Finally, while the method of least squares often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit a model. One or two outliers can sometimes seriously skew the results of a least squares analysis. This makes model validation, especially with respect to outliers, critical to obtaining sound answers to the questions motivating the construction of the model.
 +
 +
=====Inverse-Computation Trick for Matrices that are Nearly-Singular=====
 +
 +
The calculation of <math>\, \underline{\beta}</math> in linear regression and in logistic regression (described in the next lecture) requires the calculation of a matrix inverse.  For linear regression, <math>\, (\mathbf{X}\mathbf{X}^T)^{-1} </math> must be calculated.  Likewise, <math>\, (XWX^T)^{-1}</math> must be produced during the iterative method used for logistic regression. When the matrix <math>\, \mathbf{X}\mathbf{X}^T </math> or <math>\, XWX^T</math> is nearly singular, error resulting from numerical roundoff can be very large.  In the case of logistic regression, it may not be possible to determine a solution because the iterative method relies on convergence; with such large error in calculation of the inverse, the solution for entries of <math>\, \underline{\beta}</math> may grow without bound.  To improve the condition of the nearly-singular matrix prior to calculating its inverse, one trick is to add to it a very small identity matrix like <math>\, (10^{-10})I</math>.  This modification has very little effect on the exact result for the inverse matrix, but it improves the numerical calculation considerably.  Now, the inverses to be calculated are <math>\, (\mathbf{X}\mathbf{X}^T + (10^{-10})I)^{-1} </math> and <math>\, (XWX^T + (10^{-10})I)^{-1}</math>
 +
 +
====Multiple Linear Regression Analysis====
 +
Multiple linear regression is a statistical analysis which is similar to Linear Regression with the exception that there can be more than one predictor variable. The assumptions of outliers, linearity and constant variance need to be met. One additional assumption that needs to be examined is multicollinearity. Multicollinearity is the extent to which the predictor variables are related to each other. Multicollinearity can be assessed by asking SPSS for the Variance Inflation Factor (VIF). While different researchers have different criteria for what constitutes too high a VIF number, VIF of 10 or greater is certainly reason for pause. If the VIF is 10 or greater, consider collapsing the variables.
 +
 +
===Logistic Regression===
 +
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1]. Logistic regression models are usually fit by [http://mathworld.wolfram.com/MaximumLikelihood.html maximum likelihood], using the conditional probabilities <math>\displaystyle Pr(Y|X)</math>.  Since  <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the [http://mathworld.wolfram.com/MultinomialDistribution.html multinomial distribution] is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.
 +
 +
==== logistic function ====
 +
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]
 +
 +
 +
 +
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common of the [http://en.wikipedia.org/wiki/Sigmoid_function sigmoid] functions. Given below are five examples of sigmoid functions, with the first being the logistic function.
 +
 +
1. <math>y = \frac{1}{1+e^{-x}}</math>
 +
 +
2. <math>\frac{dy}{dx} = y(1-y)=\frac{-e^{-x}}{(1+e^{-x})^{2}}</math>
 +
 +
3. <math>y(0) = \frac{1}{2}</math>
 +
 +
4. <math> \int y dx = ln(1 + e^{x})</math>
 +
 +
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math>
 +
 +
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.
 +
 +
An early application of the logistic function was due to [http://en.wikipedia.org/wiki/Pierre_Fran%C3%A7ois_Verhulst Pierre-François Verhulst] who, in 1838, used the logistic function to derive a logistic equation now known as the ''Verhulst equation'' to model population growth. Verhulst was inspired by [http://en.wikipedia.org/wiki/Thomas_Malthus Thomas Malthus]'s work [http://en.wikipedia.org/wiki/An_Essay_on_the_Principle_of_Population An Essay on the Principle of Population], and his own work was published after reading Malthus' work. Independently of Verhulst, in 1925, [http://en.wikipedia.org/wiki/Alfred_J._Lotka Alfred J. Lotka] again used the logistic function to derive a logistic equation to model population growth, and he referred to his equation as the ''law of population growth''.
 +
 +
====Intuition behind Logistic Regression====
 +
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and also sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):
 +
 +
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math>
 +
 +
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].
 +
 +
====The Logistic Regression Model====
 +
 +
The logistic regression model for the two class case is defined as
 +
 +
'''Class 1'''
 +
 +
We have that
 +
[[File:Logit1.jpg‎|right|<math>P(Y=1|X=x)</math>]]
 +
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math>
 +
 +
 +
This is shown as the top figure on the right.
 +
 +
 +
 +
'''Class 0'''
 +
 +
We have that
 +
[[File:Logit0.jpg|right|<math>P(Y=0|X=x)</math>]]
 +
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math>
 +
 +
 +
This is shown as the bottom figure on the right.
 +
 +
====Fitting a Logistic Regression====
 +
Logistic regression tries to fit a distribution to the data.  The common practice in statistics is to fit a density function to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood estimate of <math>\underline\beta</math>, denoted <math>\hat \beta_{ML}</math>, maximizes the probability of observing the training data <math>\{y_i,\, x_{i1}, \ldots, x_{id}\}, i=1, \ldots, n</math> from the maximum likelihood distribution.  Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time (this is a useful trick, since <math> y_i \in \{0, 1\}</math>):
 +
 +
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math>
 +
 +
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is
 +
 +
:<math>
 +
\begin{align}
 +
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\
 +
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta)  \quad 
 +
\end{align}
 +
</math>  (by independence and identical distribution)
 +
:::<math>
 +
\begin{align}
 +
    = \prod_{i=1}^n p(x_{i};\theta)
 +
\end{align}
 +
</math>
 +
 +
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get
 +
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math>
 +
 +
So,
 
:<math>
 
:<math>
\begin{align}
+
\begin{align}
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\
+
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad    \mbox{(by independence)}\\
+
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
&= \prod_{i=1}^n p(x_{i};\theta)
+
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
\end{align}
+
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
</math>
+
\end{align}
 
+
</math>
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get
+
 
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math>
+
'''Note:''' The reader may find it useful to review [http://fourier.eng.hmc.edu/e161/lectures/algebra/node7.html vector derivatives] before continuing.
 
+
 
So,
+
To maximize the log-likelihood, set its derivative to 0.
:<math>
+
:<math>
\begin{align}
+
\begin{align}
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\
+
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
+
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
+
\end{align}
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\
+
</math>
\end{align}
+
 
</math>
+
There are n+1 nonlinear equations in <math> \beta </math>. The first column is a vector of 1's, and <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta})  </math> i.e. the expected number of class ones matches the observed number.
 +
 
 +
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative of the log-likelihood <math>\,l(\beta)</math> with respect to <math>\,\beta</math> in addition to the first derivative of <math>\,l(\beta)</math> with respect to <math>\,\beta</math>. This is demonstrated in the next section.
 +
 
 +
=== Example: logistic Regression in MATLAB ===
 +
 
 +
% function x = logistic(a, y, w, ridge, param)
 +
%
 +
% Logistic regression.  Design matrix A, targets Y, optional instance
 +
% weights W, optional ridge term RIDGE, optional parameters object PARAM.
 +
%
 +
% W is a vector with length equal to the number of training examples; RIDGE
 +
% can be either a vector with length equal to the number of regressors, or
 +
% a scalar (the latter being synonymous to a vector with all entries the
 +
% same).
 +
%
 +
% PARAM has fields PARAM.MAXITER (an iteration limit), PARAM.VERBOSE
 +
% (whether to print diagnostic information), PARAM.EPSILON (used to test
 +
% convergence), and PARAM.MAXPRINT (how many regression coefficients to
 +
% print if VERBOSE==1).
 +
%
 +
% Model is
 +
%
 +
%  E(Y) = 1 ./ (1+exp(-A*X))
 +
%
 +
% Outputs are regression coefficients X.
 +
 
 +
function x = logistic(a, y, w, ridge, param)
 +
 
 +
% process parameters
 +
 
 +
[n, m] = size(a);
 +
if ((nargin < 3) || (isempty(w)))
 +
  w = ones(n, 1);
 +
end
 +
if ((nargin < 4) || (isempty(ridge)))
 +
  ridge = 1e-5;
 +
end
 +
if (nargin < 5)
 +
  param = [];
 +
end
 +
if (length(ridge) == 1)
 +
    ridgemat = speye(m) * ridge;
 +
elseif (length(ridge(:)) == m)
 +
    ridgemat = spdiags(ridge(:), 0, m, m);
 +
else
 +
    error('ridge weight vector should be length 1 or %d', m);
 +
end
 +
if (~isfield(param, 'maxiter'))
 +
  param.maxiter = 200;
 +
end
 +
if (~isfield(param, 'verbose'))
 +
  param.verbose = 0;
 +
end
 +
if (~isfield(param, 'epsilon'))
 +
  param.epsilon = 1e-10;
 +
end
 +
if (~isfield(param, 'maxprint'))
 +
  param.maxprint = 5;
 +
end
 +
 
 +
% do the regression
 +
x = zeros(m,1);
 +
oldexpy = -ones(size(y));
 +
for iter = 1:param.maxiter
 +
  adjy = a * x;
 +
  expy = 1 ./ (1 + exp(-adjy));
 +
  deriv = expy .* (1-expy);
 +
  wadjy = w .* (deriv .* adjy + (y-expy));
 +
  weights = spdiags(deriv .* w, 0, n, n);
 +
  x = inv(a' * weights * a + ridgemat) * a' * wadjy;
 +
  if (param.verbose)
 +
    len = min(param.maxprint, length(x));
 +
    fprintf('%3d: [',iter);
 +
    fprintf(' %g', x(1:len));
 +
    if (len < length(x))
 +
      fprintf(' ... ');
 +
    end
 +
    fprintf(' ]\n');
 +
  end
 +
  if (sum(abs(expy-oldexpy)) < n*param.epsilon)
 +
    if (param.verbose)
 +
      fprintf('Converged.\n');
 +
    end
 +
    return;
 +
  end
 +
    oldexpy = expy;
 +
end
 +
warning('logistic:notconverged', 'Failed to converge');
 +
 
 +
 
 +
====Extension====
 +
 
 +
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].
 +
*An extension of the logistic model to sets of interdependent variables is the [http://en.wikipedia.org/wiki/Conditional_random_field Conditional random field].
 +
 
 +
* Advantages and Limitations of Linear Regression Model:
 +
:1. Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results.
 +
:2. Linear regression is often inappropriately used to model non-linear relationships.
 +
:3. Linear regression is limited to predicting numeric output.
 +
:4. A lack of explanation about what has been learned can be a problem.
 +
 
 +
* Limitations of Logistic Regression:
 +
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.
 +
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient estimates of the paramters in both classes. The more number of features/dimensions of the data, the larger the sample size required.
 +
:3. According to [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CC0QFjAC&url=http%3A%2F%2Fwww.csun.edu%2F~ata20315%2Fpsy524%2Fdocs%2FPsy524%2520lecture%252018%2520logistic.ppt&rct=j&q=logistic%20regression%20limitations&ei=mN7RTOC5HcWOnwfP0eho&usg=AFQjCNFBQ8BNxnc7xVArBgVgVWJOnDLMlw&sig2=_6j0mR3r92_xVGtzEJl7oA&cad=rja this source] however, the only real limitation of logistic regression as compared to other types of regression such as linear regression is that the response variable <math>\,y</math> can only take discrete values.
 +
 
 +
====Further reading ====
 +
Some supplemental readings on linear and logistic regression:
 +
 
 +
1- A simple method of sample size calculation for linear and logistic regression [http://onlinelibrary.wiley.com/doi/10.1002/%28SICI%291097-0258%2819980730%2917:14%3C1623::AID-SIM871%3E3.0.CO;2-S/pdf here]
 +
 
 +
2- Choosing Between Logistic Regression and Discriminant Analysis [http://www.jstor.org/stable/pdfplus/2286261.pdf?acceptTC=true here]
 +
 
 +
3- On the existence of maximum likelihood estimates in logistic regression models [http://biomet.oxfordjournals.org/content/71/1/1.full.pdf+html here]
 +
 
 +
==Lecture summary==
 +
 
 +
This lecture introduced logistic regression as a classification technique by using linear regression as a stepping-stone.  Classification using models found by linear regression is discouraged, but linear regression provides insight into other forms of regression.  However, one important difference between linear and logistic regression is that the former uses the Least-Squares technique to estimate parameters while the latter uses Maximum Likelihood Estimation for this task.  Maximum Likelihood Estimation works by fitting a density function (in this case, a logistic function) that maximizes the probability of observing the training data.  The lecture finishes by noting some caveats of using logistic regression.
 +
 
 +
== Logistic Regression Cont. - October 14, 2010  ==
 +
 
 +
===Logistic Regression Model===
 +
 
 +
In statistics, '''logistic regression''' (sometimes called the '''logistic model''' or '''logit model''') is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.
 +
 
 +
 
 +
Recall that in the last lecture, we learned the logistic regression model:
 +
 
 +
* <math>P(Y=1 | X=x)=P(\underline{x};\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math>
 +
* <math>P(Y=0 | X=x)=1-P(\underline{x};\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>
 +
 
 +
===Estimating Parameters <math>\underline{\beta}</math> ===
 +
 
 +
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.
 +
 
 +
From above, we have the first derivative of the log-likelihood:
 +
 
 +
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x}_i)}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math>
 +
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math>
 +
 
 +
'''Newton-Raphson Algorithm:'''<br />
 +
 
 +
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math>, we proceed by first arbitrarily picking a starting point <math>\,x^* = x^{old}</math> and we iterate the following two steps until convergence, i.e. when <math>\, x^{new}</math> is sufficiently close to <math>\, x^{old}</math> using an arbitrary criterion of closeness:
 +
<br \>
 +
Step 1:
 +
<math>\, x^{new} \leftarrow x^{old}-\frac {f(x^{old})}{f'(x^{old})} </math><br />
 +
<br \>
 +
Step 2:
 +
<math>\, x^{old} \leftarrow x^{new}</math> <br />
 +
 
 +
 
 +
If <math>\ f'(x)=0</math> , then we can replace the two steps above by the following two steps:
 +
<br \>
 +
Step 1: <math>\ x^{new} \leftarrow x^{old}-\frac {f'(x^{old})}{f''(x^{old})} </math> <br />
 +
<br \>
 +
Step 2:
 +
<math> \ x^{old} \leftarrow x^{new}</math> <br />
 +
 
 +
If we want to maximize or minimize <math>\ f(x) </math>, then we solve for the value of <math>\,x</math> at which <math>\ f'(x)=0 </math> using the following iterative updating rule that generates <math>\ x^{new}</math> from <math>\ x^{old}</math>:
 +
<br \><math>\ x^{new} \leftarrow x^{old}-\frac {f'(x^{old})}{f''(x^{old})} </math><br />
 +
 
 +
Using vector notation, the above rule can be written as <br />
 +
 
 +
<math>
 +
X^{new} \leftarrow X^{old} - H^{-1}(f)(X^{old})\nabla f(X^{old})
 +
</math>
 +
<br />
 +
where <math>\,H</math> is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\,\nabla</math> is the [http://en.wikipedia.org/wiki/Gradient gradient] or first derivative vector.
 +
<br />
 +
 
 +
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used
 +
<br />
 +
<br />
 +
 
 +
 
 +
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.
 +
 
 +
 
 +
 
 +
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=
 +
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))- exp(\underline{\beta}^T\underline{x}_i)\underline{x}_i^Texp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math>
 +
 
 +
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)
 +
 
 +
 
 +
::<math>=\sum_{i=1}^n \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)
 +
 
 +
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x}_i)}{1+exp(\underline{\beta}^T \underline{x}_i)}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}</math>)
 +
 
 +
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math>
 +
 
 +
and then solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math>
 +
 
 +
 
 +
In each of the iterative steps, starting with the existing <math>\,\underline{\beta}^{old}</math> which is initialized with an arbitrarily chosen value, the [http://en.wikipedia.org/wiki/Newton-Raphson Newton-Raphson] updating rule for obtaining <math>\,\underline{\beta}^{new}</math> is
 +
 
 +
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math>
 +
 
 +
The iterations terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math> according to an arbitrarily defined criterion.
 +
 
 +
Each iteration can be described in matrix form.
 +
 
 +
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>.  (<math>n\times1</math>)
 +
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.
 +
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>\,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.
 +
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>\,i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math>
 +
 
 +
then
 +
 
 +
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math>
 +
 
 +
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math>
 +
 
 +
The [http://en.wikipedia.org/wiki/Newton-Raphson Newton-Raphson] step is
 +
 
 +
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math>
 +
 
 +
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.
 +
 
 +
<math>
 +
\begin{align}
 +
\underline{\beta}^{new} &= \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})\\
 +
&= \,\underline{\beta}^{old}- (-XWX^T)^{-1}X(\underline{Y}-\underline{P})\\
 +
&= \,(XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}- (XWX^T)^{-1}(XWX^T)(-XWX^T)^{-1}X(\underline{Y}-\underline{P})\\
 +
&= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\
 +
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\
 +
&=(XWX^T)^{-1}XWZ
 +
\end{align}</math>
 +
 
 +
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math>
 +
 
 +
This is an adjusted response and it is solved repeatedly as <math>\, P </math>, <math>\, W </math>,  and <math>\, Z </math> are iteratively updated during the steps until convergence is achieved. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem iteratively.
 +
 
 +
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math>
 +
 
 +
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math>
 +
 
 +
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:
 +
 
 +
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X^T \underline{\beta})W(Z-X^T \underline{\beta})</math>
 +
 
 +
====Pseudo Code====
 +
First, initialize <math>\,\underline{\beta}^{old} \leftarrow 0</math> and set <math>\,\underline{Y}</math>, the labels associated with the observations <math>\,i=1...n</math>.
 +
Then, in each iterative step, perform the following:
 +
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x}_i)}{1+exp(\underline{\beta}^T \underline{x}_i)}</math> for all <math>\,i=1...n</math>.
 +
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,W_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.
 +
#Compute <math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.
 +
#<math>\underline{\beta}^{new} \leftarrow (XWX^T)^{-1}XWZ</math>.
 +
#If <math>\underline{\beta}^{new}</math> is sufficiently close to <math>\underline{\beta}^{old}</math> according to an arbitrarily defined criterion, then stop; otherwise, set <math>\,\underline{\beta}^{old} \leftarrow \underline{\beta}^{new}</math> and another iterative step is made towards convergence between <math>\underline{\beta}^{new}</math> and <math>\underline{\beta}^{old}</math>.
 +
 
 +
The following Matlab code implements the method above:
 +
 
 +
  Error = 0.01;
 +
 
 +
  %Initialize logistic variables
 +
  B_old=0.1*ones(m,1); %beta
 +
  W=0.5*ones(n,n); %weights
 +
  P=zeros(n,1);
 +
  Norm=1;
 +
 
 +
  while Norm>Error %while the change in Beta (represented by the norm between B_new and B_old) is higher than the threshold, iterate
 +
      for i=1:n
 +
          P(i,1)=exp(B_old'*Xnew(:,i))/(1+exp(B_old'*Xnew(:,i)));
 +
          W(i,i)=P(i,1)*(1-P(i,1));
 +
      end
 +
      z = Xnew'*B_old + pinv(W)*(ytrain-P);
 +
      B_new = pinv(Xnew*W*Xnew')*Xnew*W*z;
 +
      Norm=sqrt((B_new-B_old)'*(B_new-B_old));
 +
      B_old = B_new;
 +
  end
 +
 
 +
====Classification====
 +
To implement classification, we should compute <math> \underline{\beta}^{T} x</math>. If <math> \underline{\beta}^{T} x <0 </math>, then <math>\, x </math> belongs to class 0 , otherwise it belongs to class 1 .
 +
 
 +
===Comparison with Linear Regression===
 +
*'''Similarities'''
 +
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).
 +
#They both have linear boundaries.
 +
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)
 +
 
 +
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (nonlinear)
 +
 
 +
*'''Differences'''
 +
 
 +
 
 +
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.
 +
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum  up to 1. No closed form solution exists, so the Newton-Raphson algorithm is typically used to arrive at an estimate for the parameters.
 +
 
 +
===Comparison with LDA===
 +
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about  <math>\,P(X=x)</math>.
 +
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.
 +
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math>
 +
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.
 +
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.
 +
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.
 +
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA.
 +
 
 +
Robustness:
 +
#Logistic regression relies on fewer assumptions, so it is generally felt to be more robust [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (Hastie, T., et al., 2009, p. 128)]. For high-dimensionality data, logistic regression is more accommodating.
 +
#Logistic regression is also more robust because it down-weights outliers, unlike LDA [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (Hastie, T., et al., 2009, p. 128)].
 +
#In practice, Logistic regression and LDA often give similar results [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (Hastie, T., et al., 2009, p. 128)].
 +
Also in order to compare the results obtained by LDA, QDA and Logistic regression methods, following link can be used:
 +
http://www.cs.uwaterloo.ca/~a2curtis/courses/2005/ML-classification.pdf.
 +
 
 +
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].
 +
 
 +
 
 +
====By example====
 +
 
 +
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.
 +
  >>load 2_3;
 +
  >>[U, sample] = princomp(X');
 +
  >>sample = sample(:,1:2);
 +
  >>plot (sample(1:200,1), sample(1:200,2), '.');
 +
  >>hold on;
 +
  >>plot (sample(201:400,1), sample(201:400,2), 'r.');
 +
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.
 +
 
 +
  >>group = ones(400,1);
 +
  >>group(201:400) = 2;
 +
:Group the data points.
 +
 
 +
  >>[B,dev,stats] = mnrfit(sample,group);
 +
  >>x=[ones(1,400); sample'];
 +
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return <math>\underline{\beta}</math> which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, <math>\underline{\beta}</math> is a <math>3\times{1}</math> matrix.
 +
 
 +
  >> B
 +
  B =0.1861
 +
    -5.5917
 +
    -3.0547
 +
 
 +
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:
 +
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.
 +
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>
 +
 
 +
:The classification rule is:
 +
:<math>\hat Y = 1</math>,    if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math>
 +
:<math>\hat Y = 2</math>,    if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math>
 +
 
 +
  >>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));
 +
  >>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])
 +
:Plot the decision boundary by logistic regression.
 +
[[File:Boundary-lr.png‎|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]
 +
 
 +
  >>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');
 +
  >>k = coeff(1,2).const;
 +
  >>l = coeff(1,2).linear;
 +
  >>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));
 +
  >>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);
 +
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.
 +
 
 +
[[File:Boundary-lda.png‎|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]
 +
 
 +
=== Extra Matlab Examples ===
 +
 
 +
==== Example 1 ====
 +
 
 +
% This Matlab code provides a function that uses the Newton-Raphson algorithm
 +
% to calculate ML estimates of a simple logistic regression. Most of the
 +
% code comes from Anders Swensen, "Non-linear regression." There are two
 +
% elements in the beta vector, which we wish to estimate.
 +
 +
function [beta,J_bar] = NR_logistic(data,beta_start)
 +
x=data(:,1); % x is first column of data
 +
y=data(:,2); % y is second column of data
 +
n=length(x)
 +
diff = 1; beta = beta_start; % initial values
 +
while diff>0.0001 % convergence criterion
 +
  beta_old = beta;
 +
  p = exp(beta(1)+beta(2)*x)./(1+exp(beta(1)+beta(2)*x));
 +
  l = sum(y.*log(p)+(1-y).*log(1-p))
 +
  s = [sum(y-p); % scoring function
 +
  sum((y-p).*x)];
 +
  J_bar = [sum(p.*(1-p)) sum(p.*(1-p).*x); % information matrix
 +
  sum(p.*(1-p).*x) sum(p.*(1-p).*x.*x)]
 +
  beta = beta_old + J_bar\s % new value of beta
 +
  diff = sum(abs(beta-beta_old)); % sum of absolute differences
 +
end
 +
 
 +
==== Example 2 ====
 +
 
 +
% This Matlab program illustrates the use of the Newton-Raphson algorithm
 +
% to obtain maximum likelihood estimates of a logistic regression. The data
 +
% and much of the code are taken from Anders Swensen, "Non-linear regression,"
 +
% www.math.uio_no/avdc/kurs/ST110/materiale/opti_30.ps.
 +
% First, load and transform data:
 +
load 'beetle.dat'; % load data
 +
m=length(beetle(:,1)) % count the rows in the data matrix
 +
x=[]; % create empty vectors
 +
y=[];
 +
for j=1:m % expand group data into individual data
 +
  x=[x,beetle(j,1)*ones(1,beetle(j,2))];
 +
  y=[y,ones(1,beetle(j,3)),zeros(1,beetle(j,2)-beetle(j,3))];
 +
end
 +
beetle2=[x;y]';
 +
 +
% Next, specify starting points for iteration on parameter values:
 +
beta0 = [0; 0]
 +
 +
% Finally, call the function NR_logistic and use its output
 +
[betaml,Jbar] = NR_logistic(beetle2,beta0)
 +
covmat = inv(Jbar)
 +
stderr = sqrt(diag(covmat))
 +
 
 +
==== Example 3 ====
 +
 
 +
% function x = logistic(a, y, w)
 +
% Logistic regression.  Design matrix A, targets Y, optional
 +
% instance weights W.  Model is E(Y) = 1 ./ (1+exp(-A*X)).
 +
% Outputs are regression coefficients X.
 +
function x = logistic(a, y, w)
 +
epsilon = 1e-10;
 +
ridge = 1e-5;
 +
maxiter = 200;
 +
[n, m] = size(a);
 +
if nargin < 3
 +
  w = ones(n, 1);
 +
end
 +
x = zeros(m,1);
 +
oldexpy = -ones(size(y));
 +
for iter = 1:maxiter
 +
  adjy = a * x;
 +
  expy = 1 ./ (1 + exp(-adjy));
 +
  deriv = max(epsilon*0.001, expy .* (1-expy));
 +
  adjy = adjy + (y-expy) ./ deriv;
 +
  weights = spdiags(deriv .* w, 0, n, n);
 +
  x = inv(a' * weights * a + ridge*speye(m)) * a' * weights * adjy;
 +
  fprintf('%3d: [',iter);
 +
  fprintf(' %g', x);
 +
  fprintf(' ]\n');
 +
  if (sum(abs(expy-oldexpy)) < n*epsilon)
 +
  fprintf('Converged.\n');
 +
  break;
 +
  end
 +
  oldexpy = expy;
 +
end
 +
 
 +
===Lecture Summary===
 +
 
 +
Traditionally, regression parameters are estimated using maximum likelihood. However, other optimization techniques may be used as well.
 +
<br />
 +
In the case of logistic regression, since there is no closed-form solution for finding zero of the first derivative of the log-likelihood function, the Newton-Raphson algorithm is typically used to estimate parameters. This problem is convex, so the Newton-Raphson algorithm is guaranteed to converge to a global optimum.
 +
<br />
 +
Logistic regression requires less parameters than LDA or QDA, which makes it favorable for high-dimensional data.
 +
 
 +
===Supplements===
 +
 
 +
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.
 +
 
 +
 
 +
===[http://komarix.org/ac/lr Applications]===
 +
 
 +
1. Collaborative filtering.
 +
 
 +
2. Link Analysis.
 +
 
 +
3. Times Series with Logistic Regression.
 +
 
 +
4. Alias Detection.
 +
 
 +
===References===
 +
 
 +
1. Applied logistic regression
 +
[http://books.google.ca/books?hl=en&lr=&id=Po0RLQ7USIMC&oi=fnd&pg=PA1&dq=Logistic+Regression&ots=DmdTni_oGX&sig=PDYTPVdy3T115RtFbBN3_SzX5Vc#v=onepage&q&f=false]
 +
 
 +
2. External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks
 +
[http://www.jclinepi.com/article/S0895-4356%2803%2900120-3/abstract]
 +
 
 +
3. Logistic Regression: A Self-Learning Text by David G. Kleinbaum, Mitchel Klein [http://books.google.ca/books?id=J7E0JQweHkoC&printsec=frontcover&dq=logistic+regression&hl=en&ei=7WECTcvqMp-KnAeaq6HlDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CD8Q6AEwAg#v=onepage&q&f=false]
 +
 
 +
4. Two useful ppt files introducing concepts of logistic regression
 +
[http://www.csun.edu/~ata20315/psy524/docs/Psy524%20lecture%2018%20logistic.pdf] [http://www.daniel-wiechmann.eu/downloads/logreg1.pdf]
 +
 
 +
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==
 +
 
 +
=== Multi-Class Logistic Regression ===
 +
Recall that in two-class logistic regression, the class-conditional probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1.
 +
 
 +
The class-conditional probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br />
 +
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br />
 +
 
 +
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".
 +
[[File:Picture1.png‎|150px|thumb|right|<math>Fig.1:  P(Y=1 | X=x)</math>]]
 +
 
 +
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br />
 +
 
 +
In two-class logistic regression, we compare the class-conditional probability of one class to the other using this ratio:<br />
 +
 
 +
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br />
 +
 
 +
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>\,x</math>:<br />
 +
 
 +
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad  \rightarrow (*)</math> <br /><br />
 +
 
 +
What if we have more than two classes?<br />
 +
 
 +
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br />
 +
 
 +
Assume we have <math>\,k</math> classes, where <math>\,k</math> is greater than two. Putting an arbitrarily chosen class  (which for simplicity we shall assume is class <math>\,k</math>) aside, and then looking at the logarithm of the ratio of the class-conditional probability of each of the other classes and the class-conditional probability of class <math>\,k</math>, we have: <br />
 +
 
 +
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta}_1^T\underline{x} </math> <br />
 +
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta}_2^T\underline{x} </math> <br />
 +
::::<math> \vdots</math><br />
 +
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta}_{k-1}^T\underline{x} </math> <br />
 +
 
 +
 
 +
Although the denominator in the above class-conditional probability ratios is chosen to be the class-conditional probability of the last class (class <math>\,k</math>), the choice of the denominator is arbitrary in that the class-conditional probability estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br />
 +
 
 +
Each of these functions is linear in <math>\,x</math>. However, we have different <math>\underline{\,\beta}_{i}</math>'s. We have to make sure that the densities assigned to all of the different classes sum to one.<br /><br />
 +
 
 +
In general, we can write:
 +
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta}_c^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta}_l^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br />
 +
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta}_l^T \underline{x}}}</math><br />
 +
These class-conditional probabilities clearly sum to one. <br /><br />
 +
 
 +
In the case of the two-classes problem, it is pretty simple to find the <math>\,\underline{\beta}</math> parameter (the <math>\,\underline{\beta}</math> in two-class logistic regression problems has dimension <math>\,(d+1)\times1</math>), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\,\underline{\beta}</math> is of dimension <math>\ (d+1)\times(k-1)</math> and the weight matrix <math>\  W</math> is a dense and non-diagonal matrix. This results in a computationally inefficient yet feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem. This is done by suitably expanding the following: the input vector <math>\,x</math>, the vector of parameters <math>\,\beta</math>, the vector of responses <math>\,y</math>, as well as the <math>\,\underline{P}</math> vector and the <math>\,W</math> matrix used in the Newton-Raphson updating rule. For interested readers, details regarding this re-parametrization can be found in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's "Logistic Regression" slides]. Another major difference between the two-classes logistic regression and the general multi-classes logistic regression is that, unlike the former which uses the logistic sigmoid function, the latter uses the softmax function instead. Details regarding the softmax function can be found in [http://www.cedar.buffalo.edu/~srihari/CSE574/Chap4/Chap4-Part3.pdf Sargur N. Srihari's "Logistic Regression" slides]. 
 +
The Newton-Raphson updating rule however, remains the same as it is in the two-classes case, i.e. it is still <math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math>. This key point is also addressed in [http://www.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf Jia Li's slides] given above.
 +
<br /><br />
 +
 
 +
Note that logistic regression does not assume a distribution for the prior, whereas LDA assumes the prior to be Bernoulli. <br /><br />
 +
 
 +
[http://en.wikipedia.org/wiki/Random_multinomial_logit Random multinomial logit] models combine a random ensemble of multinomial logit models for use as a classifier.
 +
 
 +
=== Multiple Logistic Regression in Matlab ===
 +
 
 +
 
 +
% Examples: Multiple linear regression im Matlab
 +
 
 +
% Load data on cars identify weight and horsepower as predictors, mileage as the response:
 +
load carsmall
 +
x1 = Weight;
 +
x2 = Horsepower; % Contains NaN data
 +
y = MPG;
 +
 
 +
% Compute regression coefficients for a linear model with an interaction term:
 +
 
 +
X = [ones(size(x1)) x1 x2 x1.*x2];
 +
b = regress(y,X); % Removes NaN data
 +
 
 +
[[File:mra1.jpg]]
 +
 
 +
% Plot the data and the model:
 +
 
 +
scatter3(x1,x2,y,'filled','r')
 +
hold on
 +
x1fit = min(x1):100:max(x1);
 +
x2fit = min(x2):10:max(x2);
 +
[X1FIT,X2FIT] = meshgrid(x1fit,x2fit);
 +
YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT;
 +
mesh(X1FIT,X2FIT,YFIT);
 +
xlabel('Weight');
 +
ylabel('Horsepower');
 +
zlabel('MPG');
 +
view(50,10);
 +
 
 +
=== Matlab Code for Multiple Logistic Regression ===
 +
% Calculation of gradient and objective for Logistic
 +
% Multi-Class Classifcation.
 +
%
 +
% function [obj,grad] = mcclogistic(v,Y,V,lambda,l,varargin)
 +
% v - vector of parameters [n*p*l,1]
 +
% Y - rating matrix (labels) [n,m]
 +
% V - the feature matrix [m,p]
 +
% lambda - regularization parameter [scalar]
 +
% l - # of labels (1..l)
 +
% obj - value of objective at v [scalar]
 +
% grad - gradient at v [n*p*l,1]
 +
%
 +
% Written by Jason Rennie, April 2005
 +
% Last modified: Tue Jul 25 15:08:38 2006
 +
function [obj,grad] = mcclogistic(v,Y,V,lambda,l,varargin)
 +
  fn = mfilename;
 +
  if nargin < 5
 +
    error('insufficient parameters')
 +
  end
 +
  % Parameters that can be set via varargin
 +
  verbose = 1;
 +
  % Process varargin
 +
  paramgt;
 +
 
 +
  t0 = clock;
 +
  [n,m] = size(Y);
 +
  p = length(v)./n./l;
 +
  if p ~= floor(p) | p < 1
 +
    error('dimensions of v and Y don''t match l');
 +
  end
 +
  U = reshape(v,n,p,l);
 +
  Z = zeros(n,m,l);
 +
  for i=1:l
 +
    Z(:,:,i) = U(:,:,i)*V';
 +
  end
 +
  obj = lambda.*sum(sum(sum(U.^2)))./2;
 +
  dU = zeros(n,p,l);
 +
  YY = full(Y==0) + Y;
 +
  YI = sub2ind(size(Z),(1:n)'*ones(1,m),ones(n,1)*(1:m),YY);
 +
  ZY = Z(YI);
 +
  for i=1:l
 +
    obj = obj + sum(sum(h(ZY-Z(:,:,i)).*(Y~=i).*(Y>0)));
 +
  end
 +
  ZHP = zeros(n,m);
 +
  for i=1:l
 +
    ZHP = ZHP + hprime(ZY-Z(:,:,i)).*(Y~=i).*(Y>0);
 +
  end
 +
  for i=1:l
 +
    dU(:,:,i) = ((Y==i).*ZHP - (Y~=i).*(Y>0).*hprime(ZY-Z(:,:,i)))*V + lambda.*U(:,:,i);
 +
  end
 +
  grad = dU(:);
 +
  if verbose
 +
    fprintf(1,'lambda=%.2e obj=%.4e grad''*grad=%.4e time=%.1f\n',lambda,obj,grad'*grad,etime(clock,t0));
 +
  end
 +
 
 +
function [ret] = h(z)
 +
  ret = log(1+exp(-z));
 +
 
 +
function [ret] = hprime(z)
 +
  ret = -(exp(-z)./(1+exp(-z)));
 +
 
 +
% ChangeLog
 +
% 7/25/06 - Added varargin, verbose
 +
% 3/23/05 - made calcultions take better advantage of sparseness
 +
% 3/18/05 - fixed bug in objective (wasn't squaring fro norms)
 +
% 3/1/05 - added objective calculation
 +
% 2/23/05 - fixed bug in hprime()
 +
 
 +
The code is from [http://people.csail.mit.edu/jrennie/matlab/mcclogistic.m here].
 +
Click [http://people.csail.mit.edu/jrennie/matlab/ here] for more information.
 +
 
 +
===Neural Network Concept[http://en.wikipedia.org/wiki/Neural_network]===
 +
The concept of constructing an artificial neural network came from scientists who were interested in simulating the human neural network in their computers. They were trying to create computer programs that could learn like people. A neural network is a method in artificial intelligence, and it was thought to be a simplified model of neural processing in the brain. Later studies showed that the human neural network is much more complicated, and the structure described here is not a good model for the biological architecture of the brain. Although neural network was developed in an attempt to synthesize the human brain, in actuality it has nothing to do with the human neural system.
 +
 
 +
=== Perceptron ===
 +
 
 +
==== Content ====
 +
 
 +
[http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of Feed-Forward neural networks. The perceptron quickly became very popular after it was introduced, because it was shown to be able to solve many classes of useful problems. However, in 1969, [http://en.wikipedia.org/wiki/Marvin_Minsky Marvin Minsky] and [http://en.wikipedia.org/wiki/Seymour_Papert Seymour Papert] published their book [http://en.wikipedia.org/wiki/Perceptrons_%28book%29 ''Perceptrons'' (1969)] in which the authors strongly criticized the perceptron regarding its inability of solving simple [http://en.wikipedia.org/wiki/XOR exclusive-or (XOR)] problems, which are not linearly separable. Indeed, the simple perceptron and the single hidden-layer perceptron neural network [http://homepages.gold.ac.uk/nikolaev/311perc.htm] are not able to solve any problem that is not linearly-separable. However, it was known to the authors of this book that the multi-layer perceptron neural network can in fact solve any type of problem, including ones that are not linearly separable such as exclusive-or problems, although no efficient learning algorithm was available at that time for this type of neural network. Because of the book ''Perceptrons'', interest regarding perceptrons and neural networks in general greatly declined to a much lower point as compared to before this book was published and things stayed that way until 1986 when the [http://en.wikipedia.org/wiki/Back-propagation back-propagation] learning algorithm (which is discussed in detail below) for neural networks was popularized. <br /><br />
 +
 
 +
We know that the least-squares obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math> leads to the same coefficients as LDA (recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points). Least squares returns the sign of the linear combination of features as the class labels (Figure 2). This concept was called the Perceptron in Engineering literature during the 1950's. <br /><br />
 +
 
 +
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]
 +
 
 +
There is a cost function <math>\,\displaystyle D</math> that the Perceptron tries to minimize:<br />
 +
 
 +
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br />
 +
 
 +
where <math>\,\displaystyle M</math> is the set of misclassified points. <br><br />
 +
 
 +
By minimizing D, we minimize the sum of the distances between the misclassified points and the decision boundary.<br /><br />
 +
 
 +
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br />
 +
 
 +
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in Figure 3.<br><br />
 +
 
 +
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]
 +
 
 +
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, thus:<br />
 +
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br />
 +
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br><br />
 +
 
 +
Consider (2) - (1):<br />
 +
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br><br />
 +
 
 +
We see that <math>\,\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math>, which is in the same direction with the decision boundary, which means that <math>\,\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br><br />
 +
 
 +
Then the distance of a point <math>\,\underline{x_0}</math> from the decision boundary is: <br />
 +
 
 +
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br><br />
 +
 
 +
From (2): <br />
 +
 
 +
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br />
 +
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br />
 +
 
 +
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br />
 +
 
 +
However, this quantity is not always positive. Consider <math>\,y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>. If <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive (since only misclassified points are passed to D). <br /><br />
 +
 
 +
=== Perceptron in Action ===
 +
Here is a Java applet [http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html] which may help with the procedure of Perceptron perception. This applet has been developed in the Laboratory of Computational Neuroscience, University of EPFL, Lausanne, Switzerland.
 +
 
 +
This second applet [http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletPerceptron.html] is developed in the Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey.
 +
 
 +
This third Java applet [http://neuron.eng.wayne.edu/java/Perceptron/New38.html] has been provided by the Computation and Neural Networks Laboratory, College of Engineering, Wayne State University, Detroit, Michigan.
 +
 
 +
This fourth applet [http://husky.if.uidaho.edu/nn/jdemos/05/Fred%20Corbett/www.etimage.com/java/appletNN/NeuronTyper/MultiLayerPerceptron/perceptron.html] is provided on the official website of the University of Idaho at Idaho Falls.
 +
 
 +
=== Further Reading for Perceptron ===
 +
 
 +
1. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities
 +
[http://www.mitpressjournals.org/doi/abs/10.1162/neco.1991.3.4.461]
 +
 
 +
2. A perceptron network for functional identification and control of nonlinear systems
 +
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=286893]
 +
 
 +
3. Neural network classifiers estimate Bayesian a posteriori probabilities
 +
[http://www.mitpressjournals.org/doi/abs/10.1162/neco.1991.3.4.461]
 +
 
 +
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==
 +
===Lecture Summary===
 +
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network ([http://www.learnartificialneuralnetworks.com/robotcontrol.html#aproach1 FFNN]). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation.
 +
 
 +
To open the Neural Network discussion, we present a formulation of the [http://en.wikipedia.org/wiki/Universal_approximation_theorem universal function approximator]. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.
 +
 
 +
There is useful information in [http://page.mi.fu-berlin.de/rojas/neural/chapter/K4.pdf] by R. Rojas about Perceptron learning.
 +
 
 +
===Perceptron===
 +
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is
 +
 
 +
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math>
 +
 
 +
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.
 +
 
 +
====Perceptron Learning Algorithm====
 +
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is
 +
 
 +
<math>\nabla D(\underline{\beta},\beta_0)
 +
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\
 +
    \cfrac{\partial D}{\partial \beta_0} \end{array} \right)
 +
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\   
 +
                          -\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math>
 +
 +
However, the perceptron learning algorithm does not use the sum of the contributions from all of the observations to calculate the gradient in each step.  Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of
 +
 
 +
<math>\left( \begin{array}{c} y_{i}\underline{x}_i^T \\   
 +
                          y_{i} \end{array} \right)</math>
 +
 
 +
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:
 +
 
 +
:1) Choose a random initial value <math>\begin{pmatrix}
 +
\underline{\beta}^0\\
 +
\beta_0^0
 +
\end{pmatrix}</math> for <math>(\underline{\beta},\beta_0)</math>.
 +
 
 +
:2) <math>\begin{pmatrix}
 +
\underline{\beta}^{\mathrm{old}}\\
 +
\beta_0^{\mathrm{old}}
 +
\end{pmatrix}
 +
\leftarrow
 +
\begin{pmatrix}
 +
\underline{\beta}^0\\
 +
\beta_0^0
 +
\end{pmatrix}</math>
 +
 
 +
:3) <math>\begin{pmatrix}
 +
\underline{\beta}^{\mathrm{new}}\\
 +
\underline{\beta_0}^{\mathrm{new}}
 +
\end{pmatrix}
 +
\leftarrow
 +
\begin{pmatrix}
 +
\underline{\beta}^{\mathrm{old}}\\
 +
\underline{\beta_0}^{\mathrm{old}}
 +
\end{pmatrix}
 +
+\rho
 +
\begin{pmatrix}
 +
y_i \underline{x_i^T}\\
 +
y_i
 +
\end{pmatrix}</math> for some <math>\,i \in M</math>.
 +
 
 +
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).
 +
 
 +
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.  A larger value for <math>\,\rho</math> causes the steps to be larger.  If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).
 +
In practice, <math>\,\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\,\rho</math> could be larger than the last steps, with <math>\,\rho</math> gradually declining in size as the steps progress towards convergence. At the beginning, larger <math>\,\rho</math> helps to find the approximate answer sooner. And smaller <math>\,\rho</math> towards the last steps help to tune the final answer more accurately. Many works have been done relating to adaptive learning rates. For interested readers, an example of these works is [http://www.math.upatras.gr/~dgs/papers/reports/tr98-02.pdf this paper] by ''Plagianakos et al.'' and [http://cnl.salk.edu/~schraudo/pubs/Schraudolph99c.pdf this paper] by ''Schraudolph''.
 +
 
 +
 
 +
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.
 +
 
 +
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty). Another way that the algorithm could terminate is continuing until some other termination criterion is reached even if there are still points in <math>\,M</math>. The termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.
 +
 
 +
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations.  This number of iterations depends on the
 +
 
 +
* learning rate <math>\,\rho</math>
 +
 
 +
* initial value <math>\begin{pmatrix}
 +
\underline{\beta}^0\\
 +
\beta_0^0
 +
\end{pmatrix}</math>
 +
 
 +
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.
 +
 
 +
Note that we consider the offset term <math>\,\beta_0</math> separately from <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.
 +
 
 +
A major concern about gradient descent is that it may get trapped in local optimal solutions. Many works such as [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00298667 this paper] by ''Cetin et al.'' and [http://indian.cp.eng.chula.ac.th/cpdb/pdf/research/fullpaper/847.pdf this paper] by ''Atakulreka et al.'' have been done to tackle this issue.
 +
 
 +
====Some notes on the Perceptron Learning Algorithm====
 +
 
 +
* If there is access to the training data points in a batch form, it is better to take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed form solutions have been around many years before the invention of Perceptron).
 +
 
 +
* Just like a linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.
 +
 
 +
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.
 +
 
 +
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).
 +
 
 +
* In the case of linearly separable classes, the final solution and the number of iterations will be dependent on the initial values (which are arbitrarily chosen), the learning rate (for example, fixed or adaptive), and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.
 +
 
 +
* Learning rate --or updating step-- has a direct impact on both the number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities of this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values of the learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get to an accurate enough solution fast enough (exploration vs. exploitation). In addition, an adaptive learning rate that starts off with a large value and then gradually decreases to a small value over the steps toward convergence can be used in place of a fixed learning rate.
 +
 
 +
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar to the iteration optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.
 +
 
 +
===An example of the determination on learning rate===
 +
( Based on J. Amini  Optimum Learning Rate in Back-Propagation Neural Network for Classification
 +
of Satellite Images (IRS-1D) Scientia Iranica, Vol. 15, No. 6, pp. 558-567 )
 +
 
 +
Learning rate plays an important role in the application of Neural Network (NN). Choosing an optimum learning rate helps us to obtain the best regression model with the fastest possible speed. In the application of NN by different algorithms, the optimum learning rate tends to be determined differently. In the paper, Optimum Learning Rate in Back-Propagation Neural Network for Classification of Satellite Images (IRS-1D), the author applied one hidden layer and two hidden layers as networks to satellite images by Variable Learning Rate (VLR) algorithms and compared their optimum learning rates based on the various networks. In practice, the number of neurons should not be very small or very large. Since the network with too few neurons does not have enough degrees of freedom to train the data, but the network with too many neurons is more likely to lead to over fitting, the range of the number of neurons in the experiment is from 3 to 40. Finally, the optimum learning rate under various cases keeps 0.001-0.006. In practice, we could use a similar way to estimate the optimum learning rate to improve our models. For more details, please see the article mentioned above.
 +
 
 +
===Universal Function Approximator===
 +
In mathematics, the [http://en.academic.ru/dic.nsf/enwiki/10694320 Universal Approximation Theorem] states that the standard multilayer feed-forward neural network with a single hidden layer that contains a finite and sufficient number of hidden neurons and having an arbitrary activation function for each neuron is a universal approximator on a compact subset of <math>\mathbb{R}^n</math> under the assumption that the output units are always linear. George Cybenko first proved this theorem in 1989 for a sigmoid activation function, and thus the Universal Approximation Theorem is also called Cybenko's Theorem. For interested readers, a detailed proof of Cybenko's Theorem is given in [http://cs.haifa.ac.il/~hhazan01/Advance%20Seminar%20on%20Neuro-Computation/2010/nn1.pdf this presentation] by Yousef Shajrawi and Fadi Abboud. In 1991, Kurt Hornik showed that the potential of a particular neural network of being a universal approximator does not depend on the specific choice of the activation function used by the neurons, rather it depends on the multilayer feedforward architecture itself that is used by that neural network.
 +
 
 +
 
 +
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is
 +
 
 +
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math>
 +
 
 +
where <math>\hat{Y}(x)</math> is an estimation for a function <math>\,Y(x)</math>. According to the universal approximation theorem we have
 +
 
 +
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math>
 +
 
 +
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math> as necessary.
 +
 
 +
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>'s.
 +
 
 +
====Generalization Factors====
 +
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.
 +
 
 +
To overcome this dilemma, a common practice is to divide the set of available data points into two sets: training data and validation (test) data.  We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.
 +
 
 +
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.
 +
 
 +
===Feed-Forward Neural Network===
 +
Neural Network (NN) is one instance of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network.  One particular NN useful for classification is the Feed-Forward Neural Network ([http://www.learnartificialneuralnetworks.com/robotcontrol.html#aproach1 FFNN]), which consists of multiple "hidden layers" of Perceptron units (also known as neurons). Our discussion here is based around the FFNN, which has a topology shown in Figure 1. The neurons in the first hidden layer take their inputs, the original features (the <math>\,x_i</math>'s), and pass their inputs unchanged as their outputs to the first hidden layer. From the first layer (the input layer) to the last hidden layer, connections from each neuron are always directed to the neurons in the next adjacent layer.  In the output layer, which receives input only from the last hidden layer, each neuron produces a target measurement for a distinct class. <math>\,K</math> classes typically require <math>\,K</math> output neurons in the output layer. In the case where the target variable has two values, it suffices to have one output node in the output layer, although it is generally necessary for the single output node to have a sigmoid activation function so as to restrict the output of the neural network to be a value between 0 and 1. As shown in Figure 1, the neurons in a single layer are typically distributed vertically, and the inputs and outputs of the network are shown as the far left layer and the far right layer, respectively. Furthermore, as shown in Figure 1, it is often useful to add an extra hidden node to each hidden layer that represents the bias term (or the intercept term) of that hidden layer's hyperplane. Each bias node usually outputs a constant value of -1. The purpose of adding a bias node to each hidden layer is to ensure that the hyperplane of that hidden layer does not necessarily have to pass through the origin. In Figure 1, the bias node in the single hidden layer is the topmost hidden node in that layer. 
 +
 
 +
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]
 +
 
 +
====Mathematical Model of the FFNN with One Hidden Layer====
 +
 
 +
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows:<br /> Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of neurons in the hidden layer, and <math>\,K</math> be the number of classes which is also typically the number of neurons in the output layer in the case where <math>\,K</math> is greater than 2.
 +
 
 +
Each neuron calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each hidden neuron uses a <math>\,d</math>-dimensional vector of weights to combine these input features. For the <math>\,i</math>th hidden neuron, let <math>\underline{u}_i</math> be this neuron's vector of weights. The linear combination calculated by the <math>\,i</math>th hidden neuron is then given by
 +
 
 +
<math>a_i = \sum_{j=1}^{d}\underline{u}_{ij}^T\underline{x}_j, i={1,...,p}</math>
 +
 
 +
 
 +
However, we want the derived feature of each hidden neuron and each output neuron to lie between 0 and 1, so we apply an ''activation function'' <math>\,\sigma(a)</math> to each hidden or output neuron. The derived feature of each hidden or output neuron <math>\,i</math> is then given by
 +
 
 +
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic sigmoid function <math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math>.
 +
 
 +
 
 +
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:
 +
 
 +
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math>
 +
 
 +
As in the hidden layer, each neuron in the output layer calculates its derived feature using a linear combination of its inputs which are the elements of <math>\underline{z}</math>. Each output neuron uses a <math>\,p</math>-dimensional vector of weights to combine its inputs derived from the hidden layer. Let <math>\,\underline{w}_k</math> be the vector of weights used by the <math>\,k</math>th output neuron. The linear combination calculated by the <math>\,k</math>th output neuron is then given by
 +
<math>\hat{y}_k = \sum_{j=1}^{p}\underline{w}_{kj}^T\underline{z}_j, k={1,...,K}</math>.
 +
 
 +
<math>\,\hat y_k</math> is thus the target measurement for the <math>\,k</math>th class. It is not necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of regression since the outputs are continuous, though it is necessary to use an activation function <math>\,\sigma</math> for each of the hidden and output neurons in the case of classification so as to ensure that the outputs are in the <math> [0, 1]</math> interval.
 +
{{Cleanup|date=December 2010|reason=The sentence above is misleading, I think. The outputs will not be discrete, we need the activation function in order to keep them in the {0,1} interval. Please correct me if I'm wrong.}}
 +
 
 +
Notice that in each neuron, two operations take place one after the other:
 +
 
 +
* a linear combination of the neuron's inputs is calculated using corresponding weights
 +
 
 +
* a nonlinear operation on the linear combination is performed.
 +
 
 +
These two calculations are shown in Figure 2.
 +
 
 +
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logistic function shown earlier, are usually continuous and usually have a finite range with regard to their outputs. Another common activation function used in neural networks is the hyperbolic tangent function <math>\,\sigma(a) = tanh(a)</math> (Figure 3). The logistic sigmoid activation function <math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math> and the hyperbolic tangent activation function are very similar to  each other. One major difference between them is that, as shown in their illustrations, the output range of the the logistic sigmoid activation function is <math>\,[0,1]</math> while that of the hyperbolic tangent activation function is <math>\,[-1,1]</math>. Typically, in a neural network used for classification tasks, the logistic sigmoid activation function is used rather than any other type of activation function. The reason is that, as explained in detail in [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=809075&tag=1 this paper] by ''Helmbold et al.'', the logistic sigmoid activation function results in the least [http://www.soe.ucsc.edu/classes/cmps290c/Spring09/lect/7/pap_slides.pdf matching loss] as compared to other types of activation functions.
 +
 
 +
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]
 +
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]
 +
 
 +
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, as mentioned above, it is necessary to have a threshold stage for each of the hidden and output neurons using an activation function.
 +
 
 +
====Mathematical Model of the FFNN with Multiple Hidden Layers====
 +
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>.  In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second hidden layer each use their own combination of weights to calculate a new set of derived features.  These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.
 +
 
 +
====Back-Propagation Learning Algorithm====
 +
 
 +
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]
 +
 
 +
Every linear-combination calculation in the FFNN involves weights that need to be updated after they are initialized to be small random values, and these weights are updated using an algorithm called Back-Propagation when each data point in the training data-set is fed into the neural network. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.
 +
 
 +
First of all, we want to minimize the error between the estimated target measurement and the true target measurement of each input from the training data-set. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine
 +
 
 +
<math>\arg\min_U \left|y - \hat{y}\right|^2</math> for each data point in the training data-set.
 +
 
 +
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find
 +
 
 +
<math>
 +
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}
 +
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot
 +
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l
 +
</math>
 +
 
 +
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):
 +
 
 +
<math>
 +
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}
 +
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot
 +
  \cfrac{\partial a_i}{\partial a_j}
 +
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)
 +
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}
 +
</math>
 +
 
 +
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means
 +
 
 +
<math>
 +
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}
 +
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}
 +
= -2(y - \sum_i u_{ki}z_i)z_i
 +
= -2(y - \hat{y})z_i
 +
</math>
 +
 
 +
With similarity to our computation of <math>\,\delta_j</math>, we define
 +
 
 +
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math>
 +
 
 +
{{Cleanup|date=November 2 2010|reason= It is true that an activation function is not applied to each output neuron if the neural network is used for regression. But, if the neural network is used for classification, I think it is necessary to apply an activation function to each output neuron. I believe that this is correct. In Chapter 5.2 of Pattern Recognition and Machine Learning by Christopher Bishop  , it is written that for 2 class classification sigmoid output functions are used and for multi-class the [http://en.wikipedia.org/wiki/Softmax_activation_function Softmax]function is used.}}
 +
 
 +
{{Cleanup|date=November 2 2010|reason= To avoid an extra stage of thresholding, it is suggested for classification task to use the outputs of the hidden units themselves, instead of a linear combination of them. This does not make any sense to me. It is likely that there are more hidden units than output units , so how would you use these to do the classification? }}
 +
 
 +
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes
 +
 
 +
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}
 +
= -2(y - \hat{y})</math>
 +
 
 +
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working leftwards through the hidden layers one layer at a time towards the input layer.
 +
 
 +
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows:
 +
 
 +
First, choose small random values to initialize the network weights. Then, during each epoch (a single pass through all of the training data points), all of the training data points are sequentially fed into the FFNN one at a time. The network weights are updated using the back-propagation algorithm when each training data point <math>\underline{x}</math>is fed into the FFNN. This update procedure is done using the following steps: 
 +
 
 +
 
 +
* Apply <math>\underline{x}</math> to the FFNN's input layer, and calculate the outputs of all input neurons.
 +
 
 +
 
 +
* Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.
 +
 
 +
 
 +
* Once <math>\underline{x}</math> reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.
 +
 
 +
 
 +
* At the output layer, compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math> for each output neuron(s), then compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>, and then update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math>. Here, <math>\,\rho</math> is the learning rate.
 +
 
 +
 
 +
* Starting from the last hidden layer, back-propagate layer-by-layer to the first hidden layer. At each hidden layer, compute <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all hidden neurons in that layer, then compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>, and then update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math>. Here, <math>\,\rho</math> is the learning rate.
 +
 
 +
 
 +
Usually, a fairly large number of epochs is necessary for training the FFNN so that the network weights would be close to being their optimal values. The learning rate <math> \,\rho </math> should be chosen carefully. Usually, <math> \,\rho </math> should satisfy <math> \,\rho \rightarrow 0 </math> as the iteration times <math> i \rightarrow \infty </math>. [http://www.youtube.com/watch?v=fJ7eH0Y7xEM This] is an interesting video depicting the training procedure of the weights of an FFNN using the back-propagation algorithm.
 +
 
 +
A Matlab implementation of the pseudocode above is given as an example in the Weight Decay subsection under the [[Regularization for Neural Network - November 4, 2010|Regularization]] title.
 +
 
 +
====Alternative Description of the Back-Propagation Algorithm====
 +
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons.  We now have
 +
 
 +
<math>\begin{align}
 +
\begin{cases}
 +
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\
 +
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\
 +
\underline{y}_3=\sigma(W_3.\underline{x}_3),
 +
\end{cases}
 +
\end{align}</math>
 +
 
 +
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.
 +
 
 +
Considering this matrix equations, one can imagine a closed form for the derivative of the error with respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows:
 +
 
 +
<math>\begin{align}
 +
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\
 +
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\
 +
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},
 +
\end{align}</math>
 +
 
 +
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.
 +
 
 +
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Given below is the Matlab code for the back-propagation algorithm (<math>\,tanh</math> is utilized as the activation function).
 +
 
 +
{{Cleanup|date=November 2 2010|reason= This MATLAB code is not clear (no description for the variable and steps is provided). I am not sure, if the code in its current version, which is provided here is of any use.}}
 +
 
 +
{{Cleanup|date=November 2 2010|reason= This code might be more useful, if one consider it along with the above approach for taking derivatives of the error in respect to the weights.}}
 +
 
 +
{{Cleanup|date=November 2 2010|reason= I also think that some descriptions or comments should be added to the code to make it more clear.}}
 +
 
 +
 
 +
% This code might be used to train a neural network, using backpropagation algorithm
 +
% ep: maximum number of epochs
 +
% io: matrix of all the inputs and outputs of the network's layers, given the weights matrix, w.
 +
% w: w is the weights matrix
 +
% gp: is the derivatives matrix
 +
% shuffle: a function for changing the permutation of the data
 +
%
 +
while i < ep
 +
    i = i + 1;
 +
    data = shuffle(data,2);
 +
    for j = 1:Q
 +
        io = zeros(max(n)+1,length(n));
 +
        gp = io;
 +
        io(1:n(1)+1,1) = [1;data(1:f,j)];
 +
        for k = 1:l
 +
            io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);
 +
            gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];
 +
            io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];
 +
            wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);
 +
        end
 +
        e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];
 +
        wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);
 +
        gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);
 +
        d = eye(n(l+1)+1);
 +
        E(i) = E(i) + 0.5*norm(e)^2;
 +
        for k = l:-1:1
 +
            w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');
 +
            d = d*wg(1:n(k+1)+1,1:n(k)+1,k);
 +
        end
 +
    end
 +
end
 +
 
 +
=== The Neural Network Toolbox in Matlab ===
 +
 
 +
% Here is a problem consisting of inputs P and targets T that we would like to solve with a network.
 +
P = [0 1 2 3 4 5 6 7 8 9 10];
 +
T = [0 1 2 3 4 3 2 1 2 3 4];
 +
 
 +
% Here a network is created with one hidden layer of 5 neurons.
 +
net = newff(P,T,5);
 +
 
 +
% Here the network is simulated and its output plotted against the targets.
 +
Y = sim(net,P);
 +
plot(P,T,P,Y,’o’)
 +
 
 +
[[File:nn1.jpg]]
 +
 
 +
% Here the network is trained for 50 epochs. Again the network’s output is plotted.
 +
net.trainParam.epochs = 50;
 +
net = train(net,P,T);
 +
Y = sim(net,P);
 +
plot(P,T,P,Y,’o’)
 +
 +
[[File:nn2.jpg]]
 +
 
 +
====Some notes on the neural network and its learning algorithm====
 +
 
 +
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.
 +
 
 +
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.
 +
 
 +
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.
 +
 
 +
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training data of the network (refer to cross-validation and k-fold validation explained in the next lecture).
 +
 
 +
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.
 +
 
 +
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.
 +
 
 +
=== Deep Neural Network ===
 +
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a
 +
Neural Network with Back-propagation faces some subtleties.
 +
 
 +
Deep Neural Networks became popular two or three years ago, when introduced by Dr. Geoffrey E. Hinton, a Professor in computer science at the University of Toronto. Deep Neural Network training algorithm [http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf] deals with the training of a Neural Network with a large number of layers.
 +
 
 +
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.
 +
 
 +
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.
 +
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.
 +
 
 +
==== Difficulties of training deep architecture <ref>H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring Strategies for Training Deep Neural Networks [http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf], year = 2009, Journal of Machine Learning Research, vol. 10, pp 1-40. </ref> ====
 +
 
 +
Given a particular task, a natural way to train a deep network is to frame it as an optimization
 +
problem by specifying a supervised cost function on the output layer with respect to the desired
 +
target and use a gradient-based optimization algorithm in order to adjust the weights and biases
 +
of the network so that its output has low cost on samples in the training set. Unfortunately, deep
 +
networks trained in that manner have generally been found to perform worse than neural networks
 +
with one or two hidden layers.
 +
 
 +
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent
 +
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training
 +
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)
 +
clearly also influence the chances for random initialization to be in the basin of attraction (via
 +
gradient descent) of a poor solution. It may be that with more layers, the number or the width
 +
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural
 +
network in a constructive manner in order to divide the hard optimization problem into several
 +
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one
 +
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to
 +
be very effective for learning particularly complex functions, such as a very non-linear classification
 +
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks
 +
usually found in practice, this approach commonly overfits.
 +
 
 +
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,
 +
there actually exists many basins of attraction in its parameter space (i.e., yielding different
 +
solutions with gradient descent) that can give low training error but that can have very different generalization
 +
errors. So even when gradient descent is able to find a (possibly local) good minimum
 +
in terms of training error, there are no guarantees that the associated parameter configuration will
 +
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct
 +
this issue, but if the number of good generalization configurations is very small in comparison
 +
to good training configurations, as seems to be the case in practice, then it is likely that the training
 +
procedure will not find any of them. But, as we will see in this paper, it appears that the type of
 +
unsupervised initialization discussed here can help to select basins of attraction (for the supervised
 +
fine-tuning optimization phase) from which learning good solutions is easier both from the point of
 +
view of the training set and of a test set.
 +
 
 +
===Neural Networks in Practice===
 +
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries.
 +
 
 +
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.
 +
 
 +
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.
 +
 
 +
Neural networks have been applied to almost every field that one can think of. For the interested reader, a detailed description with links that discusses some of the many application of neural networks is available [http://www.faqs.org/faqs/ai-faq/neural-nets/part7/section-2.html here].
 +
 
 +
=== Issues with Neural Network ===
 +
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.
 +
 
 +
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but not necessarily the processes that the brain uses to learn.
 +
 
 +
As for the algorithm, as discussed above, since it does not have a convex form, it still faces the problem of getting trapped in local minima, although people have devised techniques to help it avoid this problem.
 +
 
 +
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is still an active research area in machine learning. NN still has wide applications in the engineering field such as in control.
 +
 
 +
===Business Applications of Neural Networks===
 +
 
 +
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.
 +
 
 +
* On the Use of Neural Networks for Analysis Travel Preference Data
 +
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks
 +
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks
 +
* A Neurofuzzy Model for Predicting Business Bankruptcy
 +
* Neural Networks for Analysis of Financial Statements
 +
* Developments in Accurate Consumer Risk Assessment Technology
 +
* Strategies for Exploiting Neural Networks in Retail Finance
 +
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications
 +
* Detecting Payment Card Fraud with Neural Networks
 +
* Money Laundering Detection with a Neural-Network
 +
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage
 +
 
 +
=== Further readings ===
 +
Bishop,C. "Neural Networks for Pattern Recognition"
 +
 
 +
Haykin, Simon. "Neural Networks. A Comprehensive Foundation" Available [http://www.esnips.com/doc/83becbe7-0fa6-4f90-a7c4-34697b63a8cb/Neural-Networks---A-Comprehensive-Foundation---Simon-Haykin here]
 +
 
 +
Nilsson,N. "Introduction to Machine Learning", Chapter 4: Neural Networks. Available [http://robotics.stanford.edu/people/nilsson/mlbook.html here]
 +
 
 +
Brian D. Ripley "Pattern Recognition and Neural Networks" Available [http://books.google.com/books?id=m12UR8QmLqoC&printsec=frontcover&dq=Neural+Networks+for+Pattern+Recognition&hl=en&ei=r3YCTbOlDMiYnAfh_JXmDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CDYQ6AEwAg#v=onepage&q&f=false here]
 +
 
 +
G. Dreyfus "Neural networks: methodology and applications" Available [http://books.google.com/books?id=g2J4J2bLgRQC&printsec=frontcover&dq=Neural+Networks&hl=en&ei=WncCTaimM86lngeg-OzlDQ&sa=X&oi=book_result&ct=result&resnum=3&ved=0CD4Q6AEwAg#v=onepage&q&f=false here]
 +
 
 +
===References===
 +
 
 +
1. On fuzzy modeling using fuzzy neural networks with the back-propagation algorithm
 +
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=159069]
 +
 
 +
2. Thirty years of adaptive neural networks: perceptron, madaline and backpropagation
 +
[http://onlinelibrary.wiley.com/doi/10.1002/9780470231616.app7/pdf]
 +
 
 +
==Complexity Control  - October 26, 2010==
 +
 
 +
=== Lecture Summary ===
 +
Selecting the model structure with an appropriate complexity is a standard problem in pattern recognition and machine learning. Systems with the optimal complexity have a good [http://www.csc.kth.se/~orre/snns-manual/UserManual/node16.html generalization] to yet unobserved data.
 +
 
 +
A wide range of techniques may be used which alter the system complexity. In this lecture, we present the concepts of over-fitting & under-fitting, and an example to illustrate how we choose a good classifier and how to avoid over-fitting.
 +
 
 +
Moreover, [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 cross-validation] has been introduced during the lecture which is a method for estimating generalization error based on "re-sampling" (Weiss and Kulikowski 1991; Plutowski, Sakata, and White 1994; Shao and Tu 1995)[1],[2],[3]. The resulting estimates of generalization error are often used for choosing among various models. A model which is associated with the smallest estimated generalization error would be selected. Finally, the common types of cross-validation have been addressed.
 +
 
 +
Before starting the next section, a short description of model complexity is necessary. As the name suggests, model complexity somehow describes how complicated our model is. Suppose we have a feed forward neural network -- if we increase the number of hidden layers or the number of nodes in a specific layer, it makes sense that our model is becoming more complex. Or, suppose we want to fit a polynomial function on some data points -- if we add to the degree of this polynomial it seems that we are choosing a more complex model. Intuitively, it seems that fitting a more complex model would be better, since we have more degrees of freedom and can get a more exact answer. The next section will explain why this is not the case, and why there is a trade-off between model complexity and optimal result. This makes it necessary to find methods for controlling complexity in model selection. We will see this procedure in an example.
 +
 
 +
=== Over-fitting and Under-fitting ===
 +
[[File:overfitting-model.png|500px|thumb|right|Figure 1. The overfitting model that uses kernel regression and smoothing splines passes through all of the points of the training set, but has poor predictive power for new data points that are not in the training set.
 +
 
 +
On the other hand, the line model makes more errors on the training points but it is better at extracting the main characteristic of the training points, i.e. the underlying function. Consequently, it has better predictive power for new data points that are not in the training set.]]
 +
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:
 +
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]
 +
#Underfitting
 +
 
 +
Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function, and the only goal is to build a model that is as complex as needed to pass through every training data point.
 +
 
 +
However, in the real-world, the training data are [http://en.wikipedia.org/wiki/Statistical_noise noisy], i.e. they tend to not lie exactly on the underlying function, instead they may be shifted to unpredictable locations by random noise. If the model is more complex than what it needs to be in order to accurately fit the underlying function, then it would end up fitting most or all of the training data. Consequently, it would be a poor approximation of the underlying function and have poor prediction ability on new, unseen data.
 +
 
 +
The danger of overfitting is that the model becomes susceptible to predicting values outside of the range of training data. It can cause wild predictions in multilayer perceptrons, even with noise-free data. The best way to avoid overfitting is to use lots of training data. Unfortunately, that is not always useful. Increasing the training data alone does not guarantee that over-fitting will be avoided. The best strategy is to use a large-enough size training set, and control the complexity of the model. The training set should have a sufficient number of data points which are sampled appropriately, so that it is representative of the whole data space.
 +
 
 +
In a Neural Network, if the number of hidden layers or nodes is too high, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will fit the training set very precisely, but will not be able to generalize the commonality of the training set to predict the outcome of new cases.
 +
 
 +
Underfitting occurs when the model we picked to describe the data is not complex enough, and has a high error rate on the training set.
 +
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.
 +
 
 +
'''Example'''
 +
#Consider the example shown in the figure. We have a training set and want to find a model which fits it best. We can find a polynomial of high degree which passes through almost all points in the training set. But in reality, the training set comes from a linear model. Although the complex model has little error on the training set, it diverges from the line in other ranges in which we have no training points. As a result, the high degree polynomial has very poor prediction power on the test cases. This is an example of overfitted model.
 +
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.
 +
#Consider a simple classification example: if our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie. we have overfit the data.  This occurs when we have considered so many features that we have perfectly described our existing banana that we training on, but if presented with a new banana of a slightly different shape for example, it may not be detected.  This is the tradeoff: what is the right level of complexity?
 +
 
 +
Overfitting occurs when the model is too complex and underfitting occurs when it is not complex enough, both of which are not desirable.  To control complexity, it is necessary to make assumptions for the model before fitting the data. Some of the assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.
 +
 
 +
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 2: An example of a model with a family of polynomials]]
 +
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate. The lecture will explain the [http://academicearth.org/lectures/bias-variance-tradeoff tradeoff between Bias and variance] for model complexity control.
 +
 
 +
'''Overfitted model and Underfitted model:'''
 +
 
 +
[[File:extrem_model.jpg|400px|thumb|right|Figure 3]]
 +
After the structure of the model is determined, the next step is do the model selection. The problem encountered is how to estimate the parameters effectively, especially when we use iteration methods to do the estimation. In the iteration method, the key point is to determine the best time to stop updating parameters.
 +
Let us see a very simple example; assume the dotted line on the graph can be expressed as a function <math>\,h(x)</math>, and the data points (the circles) are generated by the function with added noise.
 +
 
 +
 
 +
'''Model 1'''(as shown on the left of Figure 3)
 +
A line <math>\,g(x)</math> can be used to describe the data points, where two parameters are needed to construct the estimated function.  However, it is clear that it performs badly. This model is a typical example of an underfitted model. In this case, the model will perform well in prediction, but a large bias could be generated.
 +
 
 +
'''Model 2''' (as shown on the right of Figure 3)
 +
In this model, lots of parameters are used to fit the data. Although it looks like a fairly good fit, the prediction performance could be very bad. This means that this model will generate a large variance when we use it on points not part of the training data.
 +
The models above are the extreme cases in the model selection, we do not want to choose any of them in our classification task. The key is to stop our training process at the optimal time, such that a balance of bias and variance is obtained, that is, the time t in the following graph.
 +
 
 +
[[File:optimal_time.jpg|300px|thumb|right|Figure 4]]
 +
 
 +
To achieve this goal, one approach we can use is to divide our data points into two groups: one (training set) is used in the training process to obtain parameters, the other one (validation set) is used for determining the optimal model. After every update of parameters, the test in the validation set is implemented and the error curve is plotted to find the optimal point <math>\,t</math>. Here, the validation test is a good measure of generalization. Remember to not update the parameters in the validation test. If another, independent test is needed to follow validation, three independent groups should be determined at the beginning. In addition, this approach is suitable for the case of more data points, especially a finite data set, since the effect of noise could be decreased to the lowest level.
 +
 
 +
So far, we have learned two of the most popular ways to estimate the expected level of fit of a model to a test data set that is independent of the data used to train the model:
 +
:1. Cross validation
 +
:2. Regularization: refers to a series of techniques we can use to suppress overfitting, that is, making our function not so curved that it performs badly in prediction. The specific way is to add a new penalty term into the error function, this prevents increasing the weights too much when they are updated at each iteration.
 +
 
 +
Indeed, there are many techniques could be used, such as:
 +
:1.[http://en.wikipedia.org/wiki/Akaike_information_criterion Akaike information criterion]
 +
:2.[http://en.wikipedia.org/wiki/Bayesian_information_criterion Bayesian information criterion]
 +
:3.[http://en.wikipedia.org/wiki/Mallows'_Cp Mallows' Cp]]
 +
 
 +
===='''Note'''====
 +
When the model is linear, the true error form AIC approach is identical to that from Cp approach; when the model is nonlinear, they are different.
 +
 
 +
=== '''How do we choose a good classifier?''' ===
 +
 
 +
Our goal is to find a classifier that minimizes the true error rate<math>\ L(h)</math>.
 +
 
 +
<math>\ L(h)=Pr\{h(x)\neq y\}</math>
 +
 
 +
Recall the empirical error rate
 +
 
 +
<math>\ \hat L(h)= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math>
 +
 
 +
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 3]]</span>
 +
There is a downward bias to the training error estimate, it is always less than the true error rate.
 +
 
 +
If there is a change in our complexity from low to high, our training (empirical) error rate is always decreased. When we apply our model to the test data, our error rate will decrease to a point, but then it will increase because the model has not seen the test data points before.  This results in a convex test error curve as a function of learning model complexity. The training error will decrease when we keep fitting increasingly complex models, but as we have seen, a model too complex will not generalize well, resulting in a large test error.
 +
 
 +
We use our test data (from the test sample line shown on Figure 2) to get our true error rate.
 +
Right complexity is defined as the point where the true error rate ( the error rate associated with the test data) is minimum; this is one idea behind complexity control.
 +
 
 +
[[File:Bias.jpg|200px|thumb|left|Figure 4]]
 +
 
 +
We assume that we have samples <math>\,x_1, . . . ,x_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(x_i)</math>, the variance <math>\,var(x_i)</math> or some other quantity.
 +
 
 +
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a
 +
function of our observations, <math>\hat{f}(x_1,...,x_n)</math>.
 +
 
 +
<math>Bias (\hat{f}) = E(\hat{f}) - f</math>
 +
 
 +
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]=Varince (\hat f)+Bias^2(\hat f )</math>
 +
 
 +
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math>
 +
 
 +
One desired property of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.
 +
However, there is a more important property for an estimator than just being unbiased: low mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we risk a big error. In contrast, a biased estimator with small mean square error will improve the precision of our predictions.
 +
 
 +
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.
 +
 
 +
From figure 4, we can see that the relationship of the three parameters is:
 +
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.
 +
 
 +
'''Algebraic Proof''':
 +
 
 +
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2] = E[(\hat{f} - E(\hat{f}) + E(\hat{f}) - f)^2]</math>
 +
 
 +
<math>E[(\hat{f} - E(\hat{f}))^2+(E(\hat{f}) - f)^2 + 2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)]</math>
 +
 
 +
<math>E(\hat{f} - E(\hat{f}))^2 + E(E(\hat{f}) - f)^2 + E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f))</math>
 +
 
 +
By definition,
 +
 
 +
<math>E(\hat{f} - E(\hat{f}))^2 = Var(\hat{f})</math> 
 +
 
 +
<math>(E(\hat{f}) - f)^2 = Bias^2(\hat{f})</math>
 +
 
 +
So we must show that:
 +
 
 +
<math>E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)) = 0</math>
 +
 
 +
<math>E(2(\hat{f} - E(\hat{f}))(E(\hat{f}) - f)) = 2E(\hat{f}E(\hat{f})) - \hat{f}f - E(\hat{f})E(\hat{f}) + E(\hat{f})f)</math>
 +
 
 +
<math>2(E(\hat{f})E(\hat{f}) - E(\hat{f})f - E(\hat{f})E(\hat{f}) + E(\hat{f})f) = 0</math>
 +
 
 +
 
 +
A test error is a good estimation of MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.
 +
 
 +
=== References ===
 +
 
 +
1. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms
 +
[http://www.springerlink.com/content/u751321011502645.pdf]
 +
 
 +
2. Model complexity control and statistical learning theory
 +
[http://www.springerlink.com/content/wh40jlnrbr6cnh9x/]
 +
 
 +
3. On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition
 +
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4767011]
 +
 
 +
4. Overfitting, Underfitting and Model Complexity
 +
[http://www.chemometrie.com/phd/2_8_1.html]
 +
 
 +
=== Avoid Overfitting ===
 +
 
 +
There are 2 main approaches to avoid overfitting:
 +
 
 +
1. Estimating error rate
 +
 
 +
<math>\hookrightarrow</math> Empirical training error is not a good estimation
 +
 
 +
<math>\hookrightarrow</math> Empirical test error is a better estimation
 +
 
 +
<math>\hookrightarrow</math> Cross-Validation is fast
 +
 
 +
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.
 +
 
 +
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].
 +
 
 +
2. Regularization
 +
 
 +
<math>\hookrightarrow</math> Use of shrinkage method
 +
 
 +
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights
 +
 
 +
<math>\hookrightarrow</math> Weight Decay: bound the complexity and non-linearity of the output by a new regularized cost function.
 +
 
 +
=== Cross-Validation ===
 +
 
 +
'''Cross-validation''', sometimes called '''rotation estimation''', is a technique for assessing how the results of a statistical analysis will generalize to an independent data set.  It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.  One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the ''training set''), and validating the analysis on the other subset (called the ''validation set'' or ''testing set'').  To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
 +
 
 +
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]
 +
[http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 Cross-Validation] is the simplest and most widely used method to estimate the true error.
 +
 
 +
Here is a general description of cross-validation:
 +
 
 +
Given a set of collected data for which we know the proper labels,
 +
 
 +
:1) Randomly divide the data into two parts, Training data (T) and Validation data (V)
 +
 
 +
:2) Train the classifier using only data in T
 +
 
 +
:3) Estimate the true error rate, <math>\begin{align}\hat L(h)\end{align}</math>, using only data in V
 +
 
 +
:<math>\hat L(h) = \frac{1}{|\mathrm{V}|}\sum_{x_i \in \mathrm{V}}I(h(x_i) \neq y_i)</math>, where <math>\begin{align}\,|\mathrm{V}|\end{align}</math> is the cardinality of the validation set and
 +
:<math>\, I(h(x_i) \neq y_i)= \left\{\begin{matrix}
 +
1 &  h(x_i) \neq y_i  \\
 +
0 &  \mathrm{otherwise}  \end{matrix}\right.</math>
 +
 
 +
Note that the validation set will be totally unknown to the trained model but the proper label of all elements in this set are known. Therefore, it is easy to count the number of misclassified points in V.
 +
 
 +
The best classifier is the model with minimum true error, <math>\begin{align}\hat L(h)\end{align}</math>.
 +
 
 +
=== K-Fold Cross-Validation ===
 +
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]
 +
The results from the method above may differ significantly based on the initial choice of T and V. Therefore, we improve simple cross-validation by introducing K-fold cross-validation.
 +
The advantage of K-fold cross validation is that all the values in the dataset are eventually used for both training and testing. When using K-fold cross validation the number of folds must be considered. If the user has a large data set then more folds can be used because a smaller portion of the total data is needed to train the classifier. This leaves more test data and therefore a better estimate on the test error. Unfortunately, the more folds one uses the longer the cross-validation will run. If the user has a small data set then fewer, larger folds must be taken to properly train the classifier.
 +
 
 +
In this case, the algorithm is:
 +
 
 +
Given a set of collected data for which we know the proper labels,
 +
 
 +
: 1) Randomly divide the data into K parts with approximately equal size
 +
 
 +
: 2) For k = 1,...,K
 +
 
 +
: 3) Remove part k and train the classifier using data from all classes except part k
 +
 
 +
: 4) Compute the error rate, <math>\begin{align}\hat L_k(h)\end{align}</math>, using only data in part k
 +
 
 +
: <math>\hat L_k(h) = \frac{1}{m} \sum_{i=1}^{m} I(h(x_{i}) \neq y_{i})</math>, where <math>m</math> is the number of data points in part k
 +
 
 +
: 5) End loop
 +
 
 +
: 6) Compute the average error <math>\hat L(h) = \frac{1}{K} \sum_{k=1}^{K} \hat L_k(h)</math>
 +
 
 +
Once again, the best classifier is the model with minimum average error, <math>\begin{align}\hat L(h)\end{align}</math>.
 +
 
 +
In class we mentioned that <math>\begin{align}\hat L(h)\end{align}</math> is a high variance estimator of the error rate, but it is unbiased.
 +
 
 +
Figure 4 is an illustration of data that is divided into four roughly equal parts.
 +
 
 +
=== Leave-One-Out Cross-Validation - October 28, 2010 ===
 +
 
 +
Leave-one-out cross validation is used to determine how accurately a learning algorithm will be able to predict data that it was not trained on. When using the leave-one-out method, the learning algorithm is trained multiple times, using all but one of the training set data points. The form of the algorithm is as follows:
 +
 
 +
For k = 1 to n (where n is the number of points in our dataset)
 +
 
 +
•Temporarily remove the kth data point.
 +
 
 +
•Train the learning algorithm on the remaining n - 1 points.
 +
 
 +
•Test the removed data point and note your error.
 +
 
 +
Calculate the mean error over all n data points.
 +
 
 +
Leave-one-out cross validation is useful because it does not waste data. When training, all but one of the points are used, so the resulting regression or classification rules are essentially the same as if they had been trained on all the data points. The main drawback to the leave-one-out method is that it is expensive - the computation must be repeated as many times as there are training set data points.
 +
 
 +
 
 +
Leave-one-out cross-validation is similar to k-fold validation by selecting sets of equal size for error estimation. Leave-one-out cross-validation instead removes a single data point, with n-partitions. Each partition is used systematically for testing exactly once whereas the remaining partitions are used for training. For example, we estimate the <math>\,n-1</math> data points with <math>\,m</math> linear models over the <math>\,n</math> sets, and compare the average error rates of the m linear model.The leave-one-out error is the average error over all partitions.<br />
 +
 
 +
 
 +
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data points in the data set.<br />
 +
 
 +
But an expensive computational load does not tell the whole story. Why do we need this validation? The key factor is not having enough data points! In some real world problems gathering data points can be very expensive or time consuming. Suppose we want to study the effect of a new drug on the human body. To do this, we must test the drug on some patients. However, it is very hard to convince a person to take part in this procedure since there may be risks and side effects with testing the new drug on him/her. As well, a long-term study needs to be done to observe any long-term effects. In a similar manner we lack data points or observations in some problems. But if we use K-fold cross-validation and divide the data points into a training and test data set then we may not have enough data to train the neural network or fit any other model, and under fitting may occur. To avoid this the best thing that can be done is to do leave-one-out cross-validation. In this way we will take advantage of the data points we have and yet still be able to test the model. 
 +
 
 +
Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as
 +
the number of misclassified cases. In the latter case, k-fold cross-validation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analyses.
 +
 
 +
However, in the linear model, we can save complexity analytically. A model is ''correct'' if the mean response is the linear combination of subsets of a vector and the columns of <math>X_n</math>. Let <math>A_n</math> be a finite set of proposed models. Let <math>a_n^L</math> be the model minimizing average squared error, then the selection procedure is ''consistent'' if the probability of the model selected being <math>a_n^L</math> approaches 1. Leave-one-out is correct, can be inconsistent, and given
 +
 
 +
* <math>\max_{i <= n} x_i^t (X_n^tX_n)^{-1} x_i \to 0</math>
 +
 
 +
is asymptotically equivalent to AIC, which performs slightly worse than k-fold <ref>Shao, J. ''An asymptotic theory for linear model selection,'' Statistica Sineca, 7, 221-264 (1997).</ref>.AIC has an asymptotic probability of one of choosing a "good" subset, but less than one of choosing the "best" subset. Many simulation studies have also found that AIC overfits badly in small samples. Hence, these results suggest that leave-one-out
 +
cross-validation should overfit in small samples.
 +
<br />
 +
 
 +
Leave-one-out cross-validation can perform poorly in comparison to k-fold validation. A paper by Breiman compares k-fold (leave-many-out) cross-validation to leave-one-out cross-validation, noting that average prediction loss and downward bias increase from k-fold to leave-one-out <ref>Breiman, L. ''Heuristics of instability and stabilization in model selection,'' Annals of Statistics, 24, 2350-2383 (1996).</ref>. This can be explained by the lower bias of leave-one-out validation, causing an increase in variance. The bias is relative to the size of the sample set compared to the training set [http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#Leave-one-out_cross-validation]. As such, as k becomes larger, it becomes more biased and has less variance. Similarly, larger data sets will direct the bias toward zero.<br /><br />
 +
 
 +
====k × 2 cross-validation====
 +
This is a variation on k-fold cross-validation. For each fold, we randomly assign data points to two sets d0 and d1, so that both sets are equal size (this is usually implemented as shuffling the data array and then splitting in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0.
 +
This has the advantage that our training and test sets are both large, and each data point is used for both training and validation on each fold. In general, k = 5 (resulting in 10 training/validation operations) has been shown to be the optimal value of k for this type of cross-validation.
 +
 
 +
* One-item-out: [http://biomet.oxfordjournals.org/content/64/1/29.abstract Asymptotics for and against cross-validation]
 +
* [http://www.springerlink.com/content/tfvyva1cqvtqacvy/fulltext.pdf Leave-one-out style crossvalidation bound for Kernel methods applied to some classification and regression problems]
 +
 
 +
=== Matlab Code for Cross Validation ===
 +
1. Generate cross validation index using matlab build-in function 'crossvalind.m'. Click [http://www.mathworks.com/help/toolbox/bioinfo/ref/crossvalind.html here] for details.
 +
 
 +
2. Use 'cvpartition.m' to partition data. Click [http://www.mathworks.com/help/toolbox/stats/cvpartition.html here].
 +
 
 +
=== Further Reading ===
 +
1. Two useful pdf's introducing concepts of cross validation. [http://www.autonlab.org/tutorials/overfit10.pdf] [http://www.autonlab.org/tutorials/overfit10.pdf]
 +
 
 +
=== References ===
 +
1. Sholom M. Weiss and Casimir A. Kulikowski, Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems.
 +
Morgan Kaufmann, 1991.
 +
 
 +
2. M. Plutowski, S. Sakata and H. White: "Cross-Validation Estimates Integrated Mean Squared Error," in J. Cowan, G. Tesauro, and J. Alspector, eds., Advances in Neural Information Processing Systems 6. San Francisco: Morgan Kaufmann, 391-398 (1994).
 +
 
 +
3. Shao, J. and Tu D. (1995). The Jackknife and Bootstrap. Springer, New York.
 +
 
 +
4. http://en.wikipedia.org/wiki/Cross-validation_(statistics)
 +
 
 +
== Radial Basis Function (RBF) Network  - October 28, 2010==
 +
 
 +
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]
 +
 
 +
=== Introduction ===
 +
 
 +
A [http://en.wikipedia.org/wiki/Radial_basis_function_network Radial Basis Function] (RBF) network is a type of artificial neural network with:
 +
 
 +
* an output layer,
 +
* a single hidden layer,
 +
* weights from the hidden layer to the output layer,
 +
* and no weights from the first layer to the hidden layer.
 +
 
 +
An RBF network can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. A common basis function for  RBF network is a kind of Gaussian function without the scaling factor.
 +
 
 +
* Note: [http://ibiblio.org/e-notes/Splines/Intro.htm Spline], RBF, [http://www.aaai.org/Papers/Workshops/1999/WS-99-04/WS99-04-008.pdf Fourier], and similar methods differ only in the basis function.<br />
 +
 
 +
RBF networks were first used in solving multivariate interpolation problems and in numerical analysis. Their prospect is similar in neural network applications, where the training and query targets are continuous. RBF networks are artificial neural networks and they can be applied to Regression, Classification and Time series prediction.
 +
 
 +
For example, if we consider <math>\,n</math> data points along a one dimensional line and <math>\,m</math> clusters. An RBF network with radial basis (Gaussian) functions will cluster points around the <math>\,m</math> means, <math>\displaystyle\mu_{j}</math> for <math>j= 1, ..., m</math>. The other data points will be distributed normally around these centers.
 +
 
 +
* Note: The hidden layer can have a variable number of basis functions (the optimal number of basis function can be determined using the complexity control techniques discussed in the previous section). As usual, the more basis functions are in the hidden layer, the higher the model complexity will be.<br />
 +
 
 +
RBF networks, K-Means clustering, Probabilistic Neural Networks(PNN) and General Regression Neural Networks(GRNN) are almost the same. The main difference is that PNN/GRNN networks have one neuron for each point in the training file, whereas the number of RBF networks  neurons (basis functions) is not set, and it is usually much less than the number of training points. When the size of the training set is not very large, PNN and GRNN perform well. But for large size data sets RBF will be more useful, since PNN/GRNN are impractical.
 +
 
 +
====A brief introduction to the K-means algorithm====
 +
K-means is a commonly applied technique in clustering, which aims to divide <math>\,n</math> observations into <math>\,k</math> groups by computing the distance from each of individual observations to the <math>\,k</math> cluster centers. A typical K-means algorithm can be described as follows:
 +
 
 +
Step1: Select <math>\,k</math> as the number of clusters
 +
 
 +
Step2: Randomly select <math>\,k</math> observations from the <math>\,n</math> observations, to be used as <math>\,k</math> initial centers.
 +
 
 +
Step3: For each data point from the rest of observations, compute the distance to each of the <math>\,k</math> initial centers and classify it into the cluster with the minimum distance.
 +
 
 +
Step4: Obtain updated <math>\,k</math> cluster